This is a problem I have seen now in two different environments, at two different companies. Both happened to be using VMware Data Recovery for backups.
The problem starts like this. You lose a host from vCenter, and you cannot get it to reconnect. You do a /sbin/services.sh restart, and still you cannot get connected to vCenter.
You CAN connect to the host locally using the vSphere Client. Let’s look at the logs now.
This particular problem shows up in the host.d log. To see it, go ahead and SSH into the host and type in: tail -f /var/log/hostd.log and then go into vCenter and right click on the host to Connect.
Watching the hostd.log, if you see any messages about snapshots during the 5 minutes it takes to time out, here’s how to see if you have this issue.
In your SSH session on the affected host, type in the following:
find /vmfs/volumes/*/* -name *delta*
You’ll see a list of all snapshots for VM’s running on this host. If you see a VM with a couple hundred snapshots, this is why your host won’t connect to vCenter. vCenter has a database limitation, and when a VM has more than the number of snapshots vCenter can catalog in the database, the host cannot be managed by vCenter. I haven’t figured out the exact limit for vCenter. A VM can have 496, according to this post by William Lam, but I think vCenter breaks before you get to that point. I had 235 on this suspect one.
To fix this, just connect locally to the host with vSphere Client and Consolidate your snapshots.
Once you’ve consolidated, your directory should look like the following.
Now, you can connect back to vCenter with no problem and no downtime!
Since this is a development environment, we didn’t pay a lot of attention to VDR, and just assumed it was working. This particular VM happened to be out of hard drive space, so it could not be quiesced, and VDR just kept trying. The bottom line is, pay attention to VDR errors!!! After this, we’ll be checking it at least every few days.