Using snapshots to back up SQL under heavy IOPS loads

I find this problem coming up frequently in the field – you’ve virtualized your SQL server and you’re trying to back it up with your snapshot-based software.  In many cases, the SQL server is under such a heavy load that it’s impossible to commit the snapshot after the backup is taken. There’s just too much IO demand. You end up having to take a SQL outage to stop IO long enough to get the snapshot committed.

Here’s one strategy for setting up your virtual SQL servers to avoid this problem altogether. It uses a disk mode called independent persistent. An independent persistent disk is excluded from snapshots – all data written to an independent persistent disk is immediately committed, even if a snapshot is active on the VM. By placing SQL datafile and logfile drives in independent persistent mode, they will never be snapshot, eliminating the problem of having to commit a post-backup snapshot.

Here’s a disk layout that I’ve used for SQL servers. These drives are set to standard mode, so a snapshot picks them up.

C:\ – 40GB, SCSI 0:0
D:\ – 30GB, SQL Binaries   SCSI 0:1
L:\ – Logs, 1GB  SCSI 1:0
O:\  – Datafiles, 1GB  SCSI 2:0
Y:\   – Backups, 1GB  SCSI 3:0
Y:\SQLBAK01 – SCSI3:1, 2TB+ mounted filesystem under Y:\

Your backup drive is limited to 2TB -512B if you’re using vSphere 5.1 or earlier, but can go up to 62TB in later versions of vSphere.

L:\Logs01 – SCSI 1:1, independent persistent, variable size, mounted filesystem under L:
O:\SQLData01 – SCSI 2:1, independent persistent, variable size, mounted filesystem under O:\

Part of why we used mountpoints was for consistency – no matter what, L: was always logs, O: was always SQL data, and Y: was always backups. There were no questions as to whether a particular SQL server had a certain number of drives for a specific purpose – the entire structure was under a single, consistent drive letter.

Depending on the workload, O:\SQLData01 might have only 1 heavily used database on a single LUN, or it might have a bunch of small databases.  When we needed another one, we’d attach another mountpoint O:\SQLData02 on SCSI 2:2, L:\Logs02 on SCSI 1:2, Y:\SQLBAK02 on SCSI 3:2. Nightly backup jobs wrote SQL backups out to Y:\.  Since the Y drives are all in standard mode, backup jobs picked up the dumps in the normal snapshotting process.

If you had a complete loss of the entire SQL VM, you could restore from backup and you’d still have the L:, O:, and Y: drives with their mountpoints (although they might not have any disk attached to them), and you’d have to restore the database from the SQL dumps on Y:\.  Depending on what the nature of VM loss was, you may have to spend some time manually fixing the mounts.

It took a little bit of work to maintain, but our backups worked every time. Part of setting up a new database was that the DBAs wrote a restore script and stored it in the root of Y: which got backed up as part of the snapshot. Once the VM came back from Veeam restore, the DBAs would bring up SQL, hit the restore scripts, and we were off and running. You also need to coordinate your DBA’s backup schedule carefully with your backup software schedule – what you don’t want is to have backups being written to the Y: drive at the same time you’ve got an active snapshot in place – you could easily fill up the datastore if your backups are large enough. Some backup software allows you to execute pre-job scripting, it’s a fairly simple task to add some code in there to check if an active SQL backup was running. If so, postpone your backup snapshot and try again later.

The parent virtual disk has been modified since the child was created

Some VMs in my environment had virtual-mode RDMs on them, along with multiple nested snapshots. Some of the RDMs were subsequently extended at the storage array level, but the storage team didn’t realize there was an active snapshot on the virtual-mode RDMs. This resulted in immediate shutdown of the VMs and a vSphere client error “The parent virtual disk has been modified since the child was created” when attempting to power them back on.

I had done a little bit of work dealing with broken snapshot chains before, but the change in RDM size was outside of my wheelhouse, so we logged a call with VMware support. I learned some very handy debugging techniques from them and thought I’d share that information here. I went back into our test environment and recreated the situation that caused the problem.

In this example screenshot, we have a VM with no snapshot in place and we run vmkfstools –q –v10  against the vmdk file
-q means query, -v10 is verbosity level 10

The command opens up the disk, checks for errors, and reports back to you.



In the second example, I’ve taken a snapshot of the VM. I’m now passing the snapshot VMDK into the vmkfstools command. You can see the command opening up the snapshot file, then opening up the base disk.




In the third example, I  pass it the snapshot vmdk for a virtual-mode RDM on the same VM –  it traverses the snapshot chain and also correctly reports that the VMDK is a non-passthrough raw device mapping, which means virtual mode RDM.



Part of the problem that happened was the size of the RDM changed (increased size) but the snapshot pointed to the wrong smaller size.  However, even without any changes to the storage, a corrupted snapshot chain can  happen  during an out-of-space situation.

I have intentionally introduced a drive geometry mismatch in my test VM below – note that the value after RW in the snapshot TEST-RDM_1-00003.vmdk  is 1 less than the value in the base disk  TEST-RDM_1.vmdk



Now if I run it through the vmkfstools command, it reports the error that we were seeing in the vSphere client in Production when trying to boot the VMs – “The parent virtual disk has been modified since the child was created”. But the debugging mode gives you an additional clue that the vSphere client does not give– it says that the capacity of each link is different, and it even gives you the values (20368672 != 23068671).

The fix was to follow the entire chain of snapshots and ensure everything was consistent. Start with the most current snap in the chain. The “parentCID” value must be equal to the “CID” value in the next snapshot in the chain. The next snapshot in the chain is listed in the “parentFileNameHint”.  So TEST-RDM_1-00003.vmdk is looking for a ParentCID value of 72861eac, and it expects to see that in the file TEST-RDM_1.vmdk.

If you open up Test-RDM_1.vmdk, you see a CID value of 72861eac – this is correct.  You also see an RW value of 23068672. Since this file is the base RDM, this is the correct value. The value in the snapshot is incorrect, so you have to go back and change it to match.  All snapshots in the chain must match in the same way.



I change the RW value in the snapshot back to match  23068672 – my vmkfstools command succeeds, and I’m also able to delete the snapshot from the vSphere client6_vmkfstools


Is It Time To Remove the VCP Class Requirement – Rebuttal

This post is a rebuttal of @networkingnerd‘s blog post Is It Time To Remove the VCP Class Requirement.

I would like to acknowledge that it’s easy for me to have the perspective I do as a VCP holder since version 3. I’ve already got it, so I naturally want it to remain valuable. The fact that my employer at the time paid for the class has opened up an entire career path for me that would have otherwise been closed. But I believe the VCP cert remains fairly elite specifically because of the course requirement.

First, consider Microsoft’s certifications. As a 15-year veteran of the IT industry, I believe I am qualified to state unequivocally that Microsoft certifications are utterly worthless. This is partially because of the proliferation of braindumps. There is no knowledge requirement whatsover to sit the Microsoft exams. You don’t even need to look at a Microsoft product to pass a Microsoft test – go memorize a braindump and pass the test. We’ve all encountered paper MCSEs – their existence completely devalues the certification. I consider the MCSE nothing more than a little checkbox on some recruiter’s wish list.

I would go so far as to say that Microsoft’s test are specifically geared towards memorizers – they acutally encourage braindumping by focusing on irrelevant details and not on core skills. Passing a Microsoft exam has everything to do with memorization and almost nothing to do with your skill as a Windows admin.

On the other hand, to sit the VCP exam you have to go through a week of training. At the very least, you’ve touched the software. You installed it. You configured it. You (wait for it)… managed it.  Obviously there are braindumps out there for the VCP exam too, but everybody starts with the same core of knowledge. The VCP exams have improved to a point where they are not memorize-and-regurgitate. A person who has worked with the product actually stands a reasonable chance of passing the exam.

Quoted directly from the blog post:

For those that say that not taking the class devalues the cert, ask yourself one question. Why does VMware only require the class for new VCPs? Why are VCPs in good standing allowed to take the test with no class requirement and get certified on a new version? If all the value is in the class, then all VCPs should be required to take a What’s New class before they can get upgraded. If the value is truly in the class, no one should be exempt from taking it. For most VCPs, this is not a pleasant thought. Many that I talked to said, “But I’ve already paid to go to the class. Why should I pay again?” This just speaks to my point that the value isn’t in the class, it’s in the knowledge. Besides VMware Education, who cares where people acquire the knowledge and experience? Isn’t a home lab just as good as the ones that VMware built.

There is a tiny window of opportunity after the release of new vSphere edition to go take the exam without a course requirement. Those of us who are able to pass the exam in that small window are the people who do exactly as you say – we are downloading and installing the software in our labs. We are putting in the time to make sure that our knowledge of the newest features is up to par. Many of us partipate in alpha and beta programs, spending far more time with the software than a typical certification candidate. Some of us participate in the certification beta program, where we have just a couple of short weeks to study for and book the exam. I’ve put in quite a few late nights prepping for beta exams.

VMware forces us to learn the new features by putting a time limit on the upgrade period. We have a foundation of knowledge that was created by the original class that we took. There isn’t enough time for braindumps to leak out there, and the vast majority of upgraders wouldn’t use one anyhow. VMware can be reasonably certain that a VCP upgrader without the class really is taking the time to learn the new features. @networkingnerd is correct, the value IS in the knowledge, but the focus is ensuring that every VCP candidate starts with the same core of knowledge.

@networkingnerd suggests an alternative lower level certification such as a VCA with a much less expensive course requirement. I think it’s an interesting idea, but I don’t know how you’d put it into practice. I’m not sure what a 1-day class could prepare you for. It’s one thing for experienced vSphere admins to attend a 2-day What’s New class. But what could you really teach and test on? Just installing vSphere? There’s not a whole lot of value for an engineer who can install but not configure.

Again quoting from the article:

Employers don’t see the return on investment for a $3,000US class, especially if the person that they are going to send already has the knowledge shared in the class. That barrier to entry is causing VMware to lose out on the visbility that having a lot of VCPs can bring.

This suggests that the entry-level certification from the leader in virtualization is somehow not well-known. While I would agree that the VCAP-level certifications do not enjoy the same level of name recognition as the CCNP, I talk to seniors in college who know what the VCP is. There is no lack of awareness of the VCP certification. I also agree that it’s ridiculous to send a VMware admin who has years of experience to the Install Configure Manage class. That’s why the Optimize and Scale and the Fast Track classes exist.

I don’t believe dropping the course requirement would do anything to enhance VMware’s market share. The number of VCP individuals has long since reached a critical mass.  Nobody is going to avoid buying vSphere because of a lack of VCPs qualified to administer the environment. While I agree that Hyper-V poses a credible threat, Microsoft is just now shipping features that vSphere has had for years. Hyper-V will start to capture the SMB market, but it will be a long time before it has the chance to unseat vSphere in the enterprise.

Free vSphere 5 ICM class giveaway

Just received this from Chicago VMUG leader Chris Wahl:
Want to get VMware certified in vSphere 5? The Chicago VMUG can help!

Have you been looking at the VMware Certified Professional (VCP) in vSphere 5 and wondering “how the heck do I save up $3000 to attend the required course”? It’s crossed a lot of minds and has been a topic I’ve heard often from colleagues and members of this group. In this day of tightened budgets and spend freezes, it can be nearly impossible to get your employer to justify sending you to Install, Configure, Manage on vSphere 5 in order to get your VCP5.

The Chicago VMUG is looking to help!

Thanks to a really solid sponsor showing at our upcoming VMUG Conference (October 31st!) we are putting aside funds to make sure one grand prize winner gets their Install, Configure, Manage on vSphere 5 class paid for, in full, on the date and location of their choosing. We only ask that you be working towards getting your VCP5 and not be eligible for the exam already (this excludes VCP4s who can take the exam without a course requirement until February 2012 and those who have already taken a qualifying vSphere 5 course).

What do you have to do to enter? Simple, register for the Chicago VMUG conference (link is attached to this post) and then stop by the VMUG booth at the day of the conference. We’ll scan your badge and pick one lucky winner to get trained! The VMUG staff will contact you to set up the details after the conference, so that you can pick the right time and place.

Already a VCP4 or VCP5? Please spread the news to a friend, co-worker, or your twitter followers! For more breaking news I also invite you all to follow the @ChicagoVMUG twitter account.

Best of luck to those who enter!

Event Registration (

vSphere Datastore Last Updated timestamp – Take 2

I referenced this in an earlier post, but we continue to have datastore alarm problems on hosts running 4.0U2 connected to a 4.1 vCenter. In some cases, the timestamp on the datastore does not update, so it’s not just the alarm having a problem but also the datastore timestamp. As a stopgap measure, we scheduled a little PowerCLI script to automatically run to ensure all of the datastores are refreshed. We then accelerated our upgrade plans to quickly eliminate the problem from our primary datacenter. We now only have it in our DR site, so it’s not critical anymore, just annoying.

if ( (Get-PSSnapin -Name VMware.VimAutomation.Core -ErrorAction SilentlyContinue) -eq $null )
    Add-PsSnapin VMware.VimAutomation.Core
$ds = Get-Datastore
foreach ( $dst in $ds )
   $dsv = $dst | Get-View
   Write-Host "Refreshing "$dsv.Name   

vCenter 4.1 upgrade problem

I was upgrading vCenter from 4.0 U2 to 4.1 and installing it on a clean Windows 2008 64-bit server. The vCenter upgrade went OK, but the Update Manager install failed with “Error 25085. Setup failed to register VMware vCenter Update Manager extension to VMware vCenter Server.” I found VMware KB1024795 with a few fixes, but they did not resolve the issue.

I was trying to install Update Manager on the D: drive. I opened a ticket with VMware support and after some troubleshooting, their advice was to rebuild the 2008 server. Before starting over, I did a little more poking around. I discovered that somehow the local admins group had been removed from the D: drive permissions.

I was logged on to the domain with administrative permissions on the server and vCenter installed just fine. I’m not sure why Update Manager threw an error, but granting the local administrators group full control of D: resolved the problem.

Guest NICs disconnected after upgrade

We are upgrading our infrastructure to ESXi 4.1 and had an unexpected result in Development cluster where multiple VMs were suddenly disconnected after vMotion. It sounded a lot like a problem that I had seen before where a misconfiguration in the number of ports on a vSwitch prevents vMotioned VMs from being able to connect to the switch. If a vSwitch has too few available ports, the VMs that vMotion over are unable to connect to the switch. You generally avoid this with host profiles, but it’s possible a host in the Dev cluster fell out of sync. In any event, the server that was being upgraded this evening had been rebuilt and it wasn’t worth trying to figure out what the configuration might have been. I needed to go through, find all VMs that should have been connected but weren’t, and reconnect them. I decided that I needed:

  • VMs that were currently Powered On – obviously as Powered Off VMs are all disconnected
  • VMs with NICs currently set to “Connect at Power On” so I could avoid connecting something that an admin had intentionally left disconnected
  • VMs with NICs currently not connected

Note that this script will change network settings and REBOOT VMs if you execute it. I was watching the script while it executed, I pinged the guest DNS name first to ensure the IP wasn’t already on the network, then connected the NIC, then pinged again to make sure it was back on the network. I figured I could Control-C to stop if something looked wrong. I rebooted all of the guests to avoid any failed service / failed task problems that might have occurred while the guests were disconnected.

$vms=get-cluster "Development" | get-vm | Where { $_.PowerState -eq "PoweredOn" }| Sort-Object Name
foreach ($vm in $vms)
   $nics = $vm | get-networkadapter | Where {$_.ConnectionState.Connected -eq $false -and $_.ConnectionState.StartConnected -eq $true}
   if ($nics -ne $null)
  	 foreach ( $nic in $nics )
	     	write-host $vm.Name
	     	write-host $nic
	      	ping $vm.Guest.HostName -n 5
		$nic | Set-NetworkAdapter -Connected $true -confirm:$false
        ping $vm.Guest.HostName -n 5
	$vm | Restart-VMGuest

vSphere Datastore Last Updated timestamp

We encountered issues that mysteriously appeared after a patching window on our vCenter server. We had just upgraded to vCenter 4.1 earlier in the month and this was the first time the server had been rebooted since the upgrade.

After reboot, the Last Updated timestamp on all datastores was stuck at the time the vCenter service came online. None of our disk space alarms worked because the datastore statistics were not being updated.

We noticed that the problem only appeared on the 4.0 U2 hosts – datastores connected to 4.1 clusters had no problem.

VMware support acknowledged the timestamp update issue as a problem in 4.0 U2 that was partially addressed in 4.1 and fully addressed in the soon-to-be-released 4.1 U1