motogobi.com

Update on a “failed” HA, split-brain event

So some interesting stuff has fallen out of the investigation we’ve done regarding how VMware High Availability handled five of our hosts falling off the network last week. In speaking with VMware’s support staff I’ve learned a few things to keep in mind when planning architecture, as well as how to respond to something like this in the future (hint: don’t panic). Turns out, ESX didn’t really fail as much as it politely gave up, opting to take the route that seems to be the least harmful to our guest VMs’ operating systems. Admit it, we’ve all been there: you’re working on a Windows machine, it’s not responding, and you get to the point where you just hit the reset button. Well, VMware will let you – and only you – take that final step towards OS recovery during an event like this.

So our HA event was a little more in-depth than what High Availability is normally set up to handle. We had five hosts experience this in total but the same thing happened on each – so I’ll focus on only a single host for this post. When the host found itself isolated from the network, unable to reach any of the other hosts in the cluster via the ServiceConsole network connection, it correctly understood that it was isolated and started to gracefully shut down each of it’s running VMs. Basically this was a “stop trysoft” command sent to the VM. However, we also had problems with the network connections for the NFS datastore connections – so the ESX host had a running VM process in memory but no disk or lockfiles to work with on the back end of that VM.

According to VMware support, ESX will not  go any further in trying to shut down your VM – it will not just kill the process of that VM running in memory. If ESX can’t perform a graceful shutdown of a VM it will stop trying to get that VM process shut down.

That’s an important thing to understand and remember and directly speaks to what we saw: when some of our ESX hosts regained network connectivity, we had the classic split-brain scenario. We would open the console to a VM, see a BSOD, and power off the VM – and then almost immediately see the running operating system in that same console window. What we were watching was the BSOD process on the formerly-isolated ESX host being stopped, replaced by the healthy process of the VM when it was brought up on one of the surviving hosts.

So some take-away from this event?

  • Once we realized that we had lost both the Service Console connection and the NFS datastore connection, we simply should have taken note of what VMs were still running in memory by using esxtop and then powered off the ESX host until the network outage was resolved. The other surviving hosts were already booting up the VMs. The key here, obviously, is that this is a manual process – if this outage occurred and resolved itself before we could get to the isolated hosts we’d still end up with a split-brain problem.
  • We might want to reconsider our use of NFS and it’s dependence on network connectivity to the datastores. This major outage interrupted TCP/IP traffic at many different layers of the network, to the point where NIC redundancy and failover did not occur. It’s been my experience that FC SAN, while more expensive, also has been more robust and less prone to outages – the vast majority of our outages involving ESX have been Ethernet-related problems.

Categorised as: Virtualization


Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>