motogobi.com

vSphere High Availability and the split-brain scenario

So we ran into this last night and are trying to work through why things happened the way they did – which is to say, why we had some support personnel on the phone who said “it shouldn’t do that.” The long and short of what a split-brain scenario is this:

You have two hosts with VMs running in an HA cluster. Host 1 gets completely isolated from the network – no service console, no guest net connections, nothing – but the VMs are still running on it. If you have your HA settings set up to not power down those virtual machines on isolation, HA has already started up those VMs on Host 2 – and when Host 1 reconnects to the network (with it’s never-shut-down-VMs) you’ve now got two VMX processes running in memory on two hosts. This is not good.

The current solution is to make sure that your HA settings, as shown in the attached picture, are such that a host will power off it’s virtual machines when it finds itself isolated,  thereby making sure that only one VMX process is running as. But there is still a window of time – 12 seconds by default – where a reconnection of that isolated hosts can mean that you’ll end up with those two VMS processes running on separate hosts. In my experience so far, if you’re running HA you’re going to run into this situation sooner rather than later. Some notes on what we saw last night:

  • HA was set up correctly to shut down VMs on isolation. Yet no VMs were shut down on isolated hosts, even after several minutes of isolation.
  • We started to manually take note of running VMs on isolated hosts using esxtop, shut down the hosts by the command line through iLO, and re-registered them manually on good hosts.
  • Before we could finish this, network connectivity was restored and we found ourselves in a split-brain situation. It was fairly easy to tell what was going on by looking at the listing for VMs in the cluster. In the column for which host a VM was registered on, you’d see it change back and forth every few seconds.
  • The best course of action, unfortunately, was to right-click the VM and power it off. In some cases we’d check the VM’s console and see a BSOD, power it off, and immediately the console would switch to booted Windows login screen – that meant that one of the VMX processes was the BSOD (the “old” process) and that the new process was the one where HA had taken over and properly booted up the VM.

So the largest concern on our part right now is that isolation response – why didn’t these hosts know to start shutting down their VMs? We’re running 4.0U1, possibly 3-4 patches back, though. We’re working with VMware’s support personnel to determine the root cause on that. The good news, though, is that vSphere 4.0 Update 2 should include code to directly address this long-standing problem with High Availability. Currently, if you know enough to connect directly to each ESX host during this split-brain outbreak, you’ll be greeted with a dialog box to release the lock on a VM and resolve the split-brain situation for that VM. In Update 4, apparently, this will be auto-answered for you. After last night, I can’t wait…

zombies tag applies ’cause that’s what it feels like is going on when you’re in the thick of it ;)


Categorised as: Virtualization


Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>