A split brain (dual master) in the pool has serious implications and corrupts critical customer data.
It can occur quite easily and may be prevented quite easily with a small piece of code.
We recently had an incident of a 'split brain' n a pool where we restarted a master ('M1') and it hung on shutdown for some reason (unknown). Some of the VMs migrated whilst others shutdown without migrating.
We had no choice but to designate a new master ('M2') because the original master (M1) hung. The new master (M2) somehow didn't sync properly and saw only half of the pool. When the original master (M1) finally rebooted, it came up as a master too and started machines which were already started in the other master (M2) - causing corruption of the VDIs (which caused a havoc).
There was no communication issue or any other interference because pings did go through between the hosts and they were perfectly fine prior to the restart and after resolving the issue. We recovered by using recover-slaves and eventually things came back into order.
This horrible lack of synchronization has grave implications and may be prevented by a simple piece of code that does 'sanity test' between hosts. This code should detect a split brain by communicating with other hosts it knows via the simplest TCP socket - to determine if there's another master. This should NOT go through the XenAPI because, as we can see, it didn't work properly for whatever reason there is (I guess because it all relies on proper pool synchronization to work).
A split brain must not occur under any circumstances and this additional piece of code is a safety measure I would personally consider even though it might not seem proper to circumvent XenAPI. This will prevent disasters to VM data which are of the uttermost importance.