Guest NICs disconnected after upgrade

We are upgrading our infrastructure to ESXi 4.1 and had an unexpected result in Development cluster where multiple VMs were suddenly disconnected after vMotion. It sounded a lot like a problem that I had seen before where a misconfiguration in the number of ports on a vSwitch prevents vMotioned VMs from being able to connect to the switch. If a vSwitch has too few available ports, the VMs that vMotion over are unable to connect to the switch. You generally avoid this with host profiles, but it’s possible a host in the Dev cluster fell out of sync. In any event, the server that was being upgraded this evening had been rebuilt and it wasn’t worth trying to figure out what the configuration might have been. I needed to go through, find all VMs that should have been connected but weren’t, and reconnect them. I decided that I needed:

  • VMs that were currently Powered On – obviously as Powered Off VMs are all disconnected
  • VMs with NICs currently set to “Connect at Power On” so I could avoid connecting something that an admin had intentionally left disconnected
  • VMs with NICs currently not connected

Note that this script will change network settings and REBOOT VMs if you execute it. I was watching the script while it executed, I pinged the guest DNS name first to ensure the IP wasn’t already on the network, then connected the NIC, then pinged again to make sure it was back on the network. I figured I could Control-C to stop if something looked wrong. I rebooted all of the guests to avoid any failed service / failed task problems that might have occurred while the guests were disconnected.

$vms=get-cluster "Development" | get-vm | Where { $_.PowerState -eq "PoweredOn" }| Sort-Object Name
foreach ($vm in $vms)
{
   $nics = $vm | get-networkadapter | Where {$_.ConnectionState.Connected -eq $false -and $_.ConnectionState.StartConnected -eq $true}
   if ($nics -ne $null)
   {
  	 foreach ( $nic in $nics )
  	 {
	     	write-host $vm.Name
	     	write-host $nic
	      	ping $vm.Guest.HostName -n 5
		$nic | Set-NetworkAdapter -Connected $true -confirm:$false
	 }
 
        ping $vm.Guest.HostName -n 5
	$vm | Restart-VMGuest
 
   }

vSphere Datastore Last Updated timestamp

We encountered issues that mysteriously appeared after a patching window on our vCenter server. We had just upgraded to vCenter 4.1 earlier in the month and this was the first time the server had been rebooted since the upgrade.

After reboot, the Last Updated timestamp on all datastores was stuck at the time the vCenter service came online. None of our disk space alarms worked because the datastore statistics were not being updated.

We noticed that the problem only appeared on the 4.0 U2 hosts – datastores connected to 4.1 clusters had no problem.

VMware support acknowledged the timestamp update issue as a problem in 4.0 U2 that was partially addressed in 4.1 and fully addressed in the soon-to-be-released 4.1 U1

Alarm problem after vCenter 4.1 upgrade

UPDATE 12/23/2010
VMware support confirms that there is a bug related to the vCenter 4.1 upgrade, it appears to be specifically related to Datastore alarms. The workaround was to go through and disable, then enable all datastore alarms. At least it was better than having to delete and recreate them.
12/21/2010

We ran into an issue where our custom alarms in vCenter weren’t generating alerts after upgrade to vCenter 4.1. All of our existing alarms that were defined in vCenter 4.0 were still in place after the upgrade. However, alarming was inconsistent. We had one alarm defined on a single folder, some of the datastores that met the alert criteria were alarming, some of them weren’t.

I deleted the alarm definition and recreated it, all of the datastores that should have been alarming lit up… I have a ticket open with VMware support. At this point I’m not sure if I’m going to have to manually rebuild all of our alarms.

vSphere workingDir configuration parameter erased by Storage vMotion

The workingDir parameter allows you to configure a specific datastore for VM snapshots. We attempted to configure the workingDir parameter in several of our virtual machines. We then found out the hard way that a storage vMotion erases the workingDir parameter – a snapshot completely filled the datastore containing our C: drive and made a serious mess of the SQL VM.

VMware’s official position is that workingDir is not compatible with storage vMotion.

VMware Automatic HBA rescan

When you add a new datastore, vCenter initiates an automatic rescan on the ESX hosts in the cluster. This can create a so-called “rescan storm” with your hosts pounding away at the SAN. This can cause serious performance problems for the duration of the storm. Even if you don’t have enough hosts in a cluster to see a real problem, it’s pretty inconvenient to have to wait for each rescan when you’re trying to add 10 datastores.

To disable the automatic rescan, open your vSphere client.

1. Administration->vCenter Server
2. Settings->Advanced Settings
3. Look for “config.vpxd.filter.hostRescanFilter. If it’s not there, add it and set to false.

If it’s currently set to true, you have to edit vpxd.cfg and restart the vCenter service.
C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\vpxd.cfg, change this key to false and restart the vCenter service.
true

Doing this creates a new problem – you now have to manually force a rescan on every host.

Here is a PowerCLI one-liner to automate this process. It will perform the rescan on each host in the cluster, and only on one host at a time.

get-cluster “Production” | Get-VMHost | %{ Get-VMHostStorage $_ -RescanVMFS }

Unable to delete vSphere datastore

I had a nearly empty datastore in one of my vSphere clusters that I wanted to destroy so I could steal some of the LUNs to grow a different datastore. I migrated all machines and files off the datastore, it was completely empty, but I kept getting an error from vCenter saying it was unable to delete the datastore because the resource was in use. I eventually discovered that a VM in the cluster had its CD-ROM connected to one of the ISOs that I had moved off the datastore. Disconnected the CD-ROM and I was good to go.