Jump to content


Photo

Prevention of 'split brain' (dual master) in pools

Started by Avi Bluestein , 12 October 2017 - 07:59 AM
5 replies to this topic

Avi Bluestein Members

Avi Bluestein
  • 5 posts

Posted 12 October 2017 - 07:59 AM

Issue:

A split brain (dual master) in the pool has serious implications and corrupts critical customer data.

It can occur quite easily and may be prevented quite easily with a small piece of code.

 

Background:

We recently had an incident of a 'split brain' n a pool where we restarted a master ('M1') and it hung on shutdown for some reason (unknown). Some of the VMs migrated whilst others shutdown without migrating.

We had no choice but to designate a new master ('M2') because the original master (M1) hung. The new master (M2) somehow didn't sync properly and saw only half of the pool. When the original master  (M1) finally rebooted, it came up as a master too and started machines which were already started in the other master (M2) - causing corruption of the VDIs (which caused a havoc).

There was no communication issue or any other interference because pings did go through between the hosts and they were perfectly fine prior to the restart and after resolving the issue. We recovered by using recover-slaves and eventually things came back into order.

 

Suggestion:

This horrible lack of synchronization has grave implications and may be prevented by a simple piece of code that does 'sanity test' between hosts. This code should detect a split brain by communicating with other hosts it knows via the simplest TCP socket - to determine if there's another master. This should NOT go through the XenAPI because, as we can see, it didn't work properly for whatever reason there is (I guess because it all relies on proper pool synchronization to work). 

A split brain must not occur under any circumstances and this additional piece of code is a safety measure I would personally consider even though it might not seem proper to circumvent XenAPI. This will prevent disasters to VM data which are of the uttermost importance.



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,831 posts

Posted 12 October 2017 - 04:27 PM

Avi,

 

Citrix states that HA is officially supported on pools with at least three hosts, because of this split brain situation.

 

I suggest you look into the ability to reliably run a two-server pool using HA-Lizard (https://www.halizard.com/) and might suggest also taking a look at the article https://xenserver.org/blog/entry/xenserver-high-availability-alternative-ha-lizard-1.html

 to see some of the recent improvements that have been implemented.

 

-=Tobias



Avi Bluestein Members

Avi Bluestein
  • 5 posts

Posted 12 October 2017 - 06:27 PM

Oh excuse me if I didn't state it in the background - 

This specific pool has 14 servers, not just two.

 

This happened when server8 was the master, then it was restarted and hung, so server10 was designated master. 

When server8 completed restart (after forceful reset since it was stuck at the end of the shutdown sequence) - it came up too as a master and all hell broke loose (machines were in incorrect state - running but appeared off and it started them again while they were already running on a different host).

 

Anther issue we had related - but off topic - is failure to migrate all vm's automatically when a host is restarted (about 30-40% migrate and then the rest just die without being migrated as the shutdown proceeds). This happened every time we attempted restart without manually migrating.



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,831 posts

Posted 12 October 2017 - 07:55 PM

Ah... that's very different! :)  I would then check if NTP is correctly set and in sync (check each host with "ntpstat -s" and the offset should be preferably less than a few hundred milliseconds). If the pool is super busy, there may also be load issues, hence I'd check dom0 to make sure it has adequate resources (VCPU and memory). When you run a manual migration, run top and/or xentop and monitor the resources to see if it looks like you are hitting saturation of some sort. The swap space should always be close to zero,for example,or your dom0 doesn't have enough memory allocated.

 

What's your primary management interface configuration (physical, LAN, 1/10 GB, on a bond or not, etc.)?

 

-=Tobias



Avi Bluestein Members

Avi Bluestein
  • 5 posts

Posted Yesterday, 07:59 AM

I'd wish it was as easy as that - to blame it on some 3rd party factor.

All servers are synced (although I doubt few hundred msec are that affecting and it should not be this way because 1-2 second slips are common in systems), the NICs are fine, the BIOS is updated and all the usual stuff people ask are not the case here.

So is the load - this is a relatively new and quite 'empty' cluster having a lot of CPU and RAM to spare on each host.

To be frank, we run into a LOT of problems with XenServer. All seem to be somehow related to networking and synchronization between hosts. We have massive problems with DVSC which I'm opening in another thread.

 

The issue above turned into something even worse.

After all slaves have been accounted for, it turned out we have invisible 'ghost' machines that are working on one of the hosts and are not visible anywhere in XenCenter nor in any CLI command that enumerates the running VMs. Furthermore, the memory they consume doesn't even get accounted for in XenCenter as you inspect the RAM layout and the 'free' RAM listed there is actually incorrect considering 16GB that were consumed by these ghosts.

The only way to kill them was to reboot the host, which actually hung on shutdown as well when dismounting SRs and we had to forcefully reboot.

 

Too many weird unstable things and this is a new cluster, not some 'upgraded' old. Clean install with ALL patches of 7.1.

In a similar other cluster we had a lot of problems and we even had a weird thing that the guys from XOA noticed - we had a VM running off a SNAPSHOT and not off a VDI. This is wy XOA could not snapshot it and got a 'non existent VDI' error when attempting to snapshot. They checked it out and it turned out the VM was running off a snapshot - which is 'impossible' as per their claim.

 

XenServer is by far the most unstable hypervisor I've encountered. It is such a great alternative to VMWare but by god we don't get one quiet week without problems in our clusters (running off IBM / HP blades).



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,831 posts

Posted Yesterday, 08:43 PM

Anything more than around 200 - 300 msec can be problematic. A "fe seconds" can be enough of an offset to cause issues. The pool is very sensitive to being properly synchronized. I'd check each host with "ntpstat -s" to see what the offsets are and make sure NTP is properly being synched.

 

A VM can run off a snapshot if it's a fast clone and not a full clone and sharing the same disks; I have proven that to myself again just last week.

 

-=Tobias