Jump to content


Photo

What the HELL just happened to my XenServer (7.1) infrastructure??

Started by Gwyn Williams , 04 April 2017 - 05:02 PM
6 replies to this topic

Gwyn Williams Members

Gwyn Williams
  • 410 posts

Posted 04 April 2017 - 05:02 PM

This afternoon, at 15:46, every single VM in my XenServer 7.1 infrastucture shut itself down.  Not all at the same time, though - but 5 at a time.  It began by going through all my 45 PVS XenApp servers, starting from #41 to #45, then working its way down, 5 at a time.  At first I thought it was exclusive to the PVS servers, and was therefore linked to a separate VLAN that we have for PVS network-boot traffic.  But after doing about 35 of the 45, it then began to shut the others down, the 50-ish other standalone servers that I have doing various bits.
 
It wasn't a graceful shutdown either - the usual green dot would change to the yellow dot, and during this time I would go to the console screen and the Windows logon screen would still show, but the Ctrl-Alt-Del button wouldn't do anything.  Then after about 5 mins in this state, the VM would turn to a proper powered-down state, at which point I'd be able to boot it back up.  At this point it would start that process on another set of 5 VMs. This happened on every VM across my 4 hosts, but happened only once for each VM.  It affected Windows Servers, Windows desktop, and Linux VMs, and XenServer virtual appliances alike.  There is no info about this in the event logs of the machines, because I think this was just a cold shutdown.
 
Only two things were changed around the time that this happened, and I struggle to see how they could be linked, but here goes...
 
We have a pair of Linux load balancer applications (Zen Load Balancer), which provide a load balanced IP address to sit in front of our two StoreFront servers.  But a problem developed with these about 6 months ago which I couldn't fix, so I resorted to DNS round robin as a replacement.  It wasn't until today (at exactly 15:46, as it happens) that I shut these VMs down.  Out of sheer desperation, after half of these servers had shutdown, I powered these load balancers back up.  But it did no difference - the VMs kept on shutting down.
 
The other thing that was changed is that a colleague of mine installed the vSwitch Controller as a virtual appliance, at around 14:00 - but the didn't go for the option to 'take over' the current XenServer switching - we were going to do that out-of-hours in case it screwed something up.  So I don't think the vSwitch is actually doing much currently.
 
D'you know what I need to look out for in the XenServer logs, and which log file?  Thanks.

 



Alan Lantz Members

Alan Lantz
  • 6,910 posts

Posted 04 April 2017 - 05:56 PM

Thats a good question and I hate the weird stuff. But it doesn't sound so much like XenServer initiated as it does maybe PVS initiated. Do you use WLB? Maybe it went haywire. Do any of your VM's save logs to a persistent location? Maybe that could give you some insight. 

 

--Alan--



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,179 posts

Posted 04 April 2017 - 06:51 PM

Are all your servers properly synchronized to NTP?



Gwyn Williams Members

Gwyn Williams
  • 410 posts

Posted 05 April 2017 - 12:09 PM

Checked NTP, and all seems well on that front - time is bang on.

 

We don't use WLB.

 

I thought PVS initially too, but unlikely now seeing as all the other non-PVS VMs were affected too.  I've checked the logs of several servers (PVS and non-PVS) but none have any info about what happened.

 

Is there anything at all about the aforementioned Load Balancer that could possibly have caused this?  To me it sounds very unlikely, however, when speaking to a colleague last night I learnt that this exact same thing happened about 2 months ago, when an upgrade of XenServer was attempted, when I was not in the office.  What they did was firstly migrated everything off Host #1, but a couple of VMs didn't have XenServer Tools on them (the Zen Load Balancer being one of them, because it's an appliance, and one Windows 7 VM), so these two were shut down manually before beginning the XenServer upgrade process.  In the middle of the install the exact symptoms that I described in my OP began happening - servers shutting down, 5 at a time, starting from PVS #41 to #45, then working its way down to #1 to #5, then carrying on through all the non-PVS servers.

 

So I think it's too much of a coincidence.  Yesterday, I did power the load balancer back up for a while, and this didn't stop the shutdowns, but I think this was just XenServer queuing the job, and was only processing 5 at a time.

 

Is it in any way possible that a load balancer could cause this?

 

Thanks.



Alan Lantz Members

Alan Lantz
  • 6,910 posts

Posted 05 April 2017 - 01:37 PM

Odd. Sounds like you have narrowed it down, and I'm not familiar with your load balancer. But still odd that a load balancer would stop/restart VM's in groups of five. As far as XenServer queuing restart jobs, you should see that in notifications/events where that occured. I don't know as I restart VM's manually and rarely in quantity, so I don't know how the 7.x series handles multiple VM restart requests. So far I don't have any pools with a large VM density to test.

 

--Alan--



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,179 posts

Posted 06 April 2017 - 03:21 AM

Sounds suspicious. Did you run a CIS (Citrix Insight Services) health check? We don't use WLB, and if need be, have some scripts we developed in-house to balance things. WLB could be way more than the couple of things it can do. Intelligent scripts could do almost anything imaginable, minus the "neat" GUI front end, based on the plethora of available metrics.

 

-=Tobias



Gwyn Williams Members

Gwyn Williams
  • 410 posts

Posted 21 April 2017 - 03:22 PM

I'm afraid we don't use Citrix Insight Services.  And we don't use WLB either.  Things have been OK since then, so now I'm reasonably sure that our problems on both occasions we due to the Zen Load Balancer.  Not wanting to repeat this, I've duly deleted my load balancer and have gone back to good old DNS round robin.

 

Now, Citrix need to REALLY pay attention to this, and they need to try to recreate this, because if this is true (and I'm about 80% certain), then this is an alarming back-door into XenServer that needs to be closed.
 
For the record, if they're reading this, it was the Zen Load Balancer v3.05 Community Edition.