Jump to content


Photo

XenServer 6.2sp1 hosts crashing

Started by Will Sours , 23 May 2014 - 04:20 PM
10 replies to this topic

Will Sours Members

Will Sours
  • 13 posts

Posted 23 May 2014 - 04:20 PM

I recently tediously migrated all my vm's to a new 6.2 sp1 pool with all updates applied from a 6.0.2 pool on old hardware (which required re-install of xentools on most all vm's that weren't winServer 2003). All are on seperate iSCSI volumes so I just detached from old and attached on the new and created a new VM on new pool and attached the storage to it. I migrated them over a couple week period and didn't have any real issues. But in the last few weeks since, I've had 4 instances where a host on the new pool showed as being offline in XenCenter, but the vm's were still running. A reboot of the host resolved it all but one time, and that time I also had to do an emergency network reset. The last couple of times I accessed the hosts console through the iDRAC's to do some further troubleshooting first. The error on the console when this happens is "The underlying Xen API xapi is not running. This console will have reduced functionality. Would you like to attempt to restart xapi?", and I do, but it doesn't restart. I google'd and this seems to normally be caused by the volume being full, but I'm only at about 63% full. Also have plenty of free memory. The first time I uploaded the pool logs to the Citrix log parser site and it said it detected an out of memory error (with no known cause), but like I said the last couple times it showed plenty of free memory (free -m showed ~3GB or so free). third time this happened I found this article (http://www.riverlite.co.uk/2014/01/the-underlying-xen-api-xapi-is-not-running-this-console-will-have-reduced-functionality/) and found the state.db file to have an old timestamp on it from about a month ago (and found the same thing on another host) and renamed it and restarted xapi as suggested but that didn't fix it either, and it subsequently happened a couple days later and the state.db file had the correct date. I'm not a XS expert and am currently using free version of XS (but am awaiting a quote to get it licensed) so I don't have support. Any ideas? My pool is running on new Dell R620's with latest BIOS's/firmware's as of March, no drivers were installed during XS install.



Alan Lantz Members

Alan Lantz
  • 7,294 posts

Posted 23 May 2014 - 04:28 PM

What I have found regarding this is its a problem with your management interface. Do you have a bonded management interface? If so go to just a single interface and see if your issue resolves. If so you have an issue between your bonding and your switches. If you are running a single interface I would start digging into ethernet errors and Wireshark to see whats going on. This also assumes you have the correct drivers with your nics installed with your ServicePack/Hot Fix version and that your hardware is up to date on firmware.

 

Alan Lantz

SysAdmin

City of Rogers, AR



Will Sours Members

Will Sours
  • 13 posts

Posted 23 May 2014 - 04:35 PM

Thanks Alan, I currently am running these with only 2 network interfaces on each host: management / public network, and SAN network. I saw a lot of issues in the past with bonded interfaces so we just stuck with one for now. I'll go take a look at the switch ports etc and see if there's issues, but its happened on multiple hosts (3 out of 4, all but pool master) and using same switches as my old pool.



James Cannon Citrix Employees

James Cannon
  • 4,402 posts

Posted 23 May 2014 - 04:45 PM

Hi Will,

 

Can you clarify "new 6.2 sp1 pool with all updates applied from a 6.0.2 pool"? You have all patches on 6.0.2, but VMs are on 6.2 pool?

 

Do you have 6.2 patches applied (and any required device drivers) to the 6.2 hosts? Here is document to guide you to ensure that your 6.2 hosts are up to date:

 

http://support.citrix.com/article/CTX138115



Will Sours Members

Will Sours
  • 13 posts

Posted 23 May 2014 - 05:11 PM

Hi Will,

 

Can you clarify "new 6.2 sp1 pool with all updates applied from a 6.0.2 pool"? You have all patches on 6.0.2, but VMs are on 6.2 pool?

 

Do you have 6.2 patches applied (and any required device drivers) to the 6.2 hosts? Here is document to guide you to ensure that your 6.2 hosts are up to date:

 

http://support.citrix.com/article/CTX138115

Hi James,

 

sorry I could have made that more clear.. The xenserver 6.2 was a fresh install, and I applied SP1 & all hotfixes before moving any vm's to it. The only update I'm showing in XenCenter is for XenCenter, and according to another forum post I read that appears to be a bug. I didn't have to install any separate drivers, and am not using any of the devices in that ctx doc. And correction to my original post, the old pool was XenServer 6.2 not 6.0.2, without some of the latest hotfixes. It was on old HP blades and was stable.



James Cannon Citrix Employees

James Cannon
  • 4,402 posts

Posted 23 May 2014 - 05:19 PM

Hi Will,

 

Thank you for the clarification. Once you apply SP1 there should be device drivers updated. Here is link to document:

http://support.citrix.com/article/CTX139791

 

We can help to identify any needed drivers.

 

To find out what a driver is used, we can use the Linux ethool command. For example for NIC 0, we would use:

ethtool -i eth0

 

Maybe you have an Emulex converged network adapter (CNA) card, which may have a firmware driver mismatch. If you had an actual crash, you would have a crash dump file and the host icon would be different in XenCenter (indicating host crash). What you are describing is lost of network or management (XAPI is not running). This means we don't have service to communicate and can likely ping hosts. 

 

We also need to know what hotfixes you have applied for SP1, as hotfix 5 for SP1 has a different document for device drivers. We do not want to have newer xen kernel (from SP1) and drivers for base 6.2 version, as this could cause a binary mismatch with driver and kernel.



Will Sours Members

Will Sours
  • 13 posts

Posted 23 May 2014 - 05:26 PM

Hi Will,

 

Thank you for the clarification. Once you apply SP1 there should be device drivers updated. Here is link to document:

http://support.citrix.com/article/CTX139791

 

We can help to identify any needed drivers.

 

To find out what a driver is used, we can use the Linux ethool command. For example for NIC 0, we would use:

ethtool -i eth0

 

Maybe you have an Emulex converged network adapter (CNA) card, which may have a firmware driver mismatch. If you had an actual crash, you would have a crash dump file and the host icon would be different in XenCenter (indicating host crash). What you are describing is lost of network or management (XAPI is not running). This means we don't have service to communicate and can likely ping hosts. 

 

We also need to know what hotfixes you have applied for SP1, as hotfix 5 for SP1 has a different document for device drivers. We do not want to have newer xen kernel (from SP1) and drivers for base 6.2 version, as this could cause a binary mismatch with driver and kernel.

I'm using Intel nics, which aren't on that ctx doc:

driver: ixgbe
version: 3.14.5
firmware-version: 0x800004cf, 15.0.28

 

And here's specific hotfixes applied to all 4 hosts:

XS62E001
XS62E002
XS62E004
XS62E005
XS62E007
XS62E008
XS62E009
XS62E010
XS62E011
XS62E012
XS62E013
XS62E014
XS62ESP1
XS62ESP1002
XS62ESP1004


James Cannon Citrix Employees

James Cannon
  • 4,402 posts

Posted 23 May 2014 - 05:38 PM

Ok, so there is no need to update NIC driver. :)

 

Looking at hotfix 5, there are a number of fixes:

http://support.citrix.com/article/CTX140553

 

 

  1. If the control domain (dom0) experiences CPU Soft Lockup issues, the host can fail and then restart. When this happens, the following message appears: BUG: soft lockup - CPU#0 stuck for 61s!.
  2. In rare cases, XenServer hosts can enter a deadlock due to a bug in the spin lock implementation and display the error message: BUG: soft lockup - CPU#0 stuck for 61s!.
  3. When certain dom0 resources, such as event channels or vmalloc space are exhausted, plugging and then unplugging a virtual network interface (VIF) to a virtual machine (VM) can cause the host to fa

 

The first 2 could be a cause for XAPI not to be running. You would have to examine the log files of host (/var/log/kern.log would be a good one to look at first).



Will Sours Members

Will Sours
  • 13 posts

Posted 23 May 2014 - 06:14 PM

Ok, so there is no need to update NIC driver. :)

 

Looking at hotfix 5, there are a number of fixes:

http://support.citrix.com/article/CTX140553

 

 

  1. If the control domain (dom0) experiences CPU Soft Lockup issues, the host can fail and then restart. When this happens, the following message appears: BUG: soft lockup - CPU#0 stuck for 61s!.
  2. In rare cases, XenServer hosts can enter a deadlock due to a bug in the spin lock implementation and display the error message: BUG: soft lockup - CPU#0 stuck for 61s!.
  3. When certain dom0 resources, such as event channels or vmalloc space are exhausted, plugging and then unplugging a virtual network interface (VIF) to a virtual machine (VM) can cause the host to fa

 

The first 2 could be a cause for XAPI not to be running. You would have to examine the log files of host (/var/log/kern.log would be a good one to look at first).

 

I'm looking at the kern.log and not seeing either of those first two errors. I am seeing a huge burst of activity when my problem occurred, I see a ton of these OOMkill's were logged:

 

May 22 18:01:01 DXS62-3 kernel: [4835275.703603] OOMkill: task 6575 (multipathd) got 0 points (base total_vm 1244, 0 children gave 0 points, cpu_time 5, runtime 4722, is_nice no, is_super yes, is_rawio yes, adj -16)
May 22 18:01:01 DXS62-3 kernel: [4835275.703645] OOMkill: task 6779 (ovsdb-server) got 3 points (base total_vm 1373, 1 children gave 731 points, cpu_time 25, runtime 4722, is_nice no, is_super yes, is_rawio yes, adj 0)

 

I see this right after those:

May 22 18:01:01 DXS62-3 kernel: [4835275.707462] mcelog.cron invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0

May 22 18:01:01 DXS62-3 kernel: [4835275.707479] Pid: 3579, comm: mcelog.cron Not tainted 2.6.32.43-0.4.1.xs1.8.0.835.170778xen #1
May 22 18:01:01 DXS62-3 kernel: [4835275.707493] Call Trace:
May 22 18:01:01 DXS62-3 kernel: [4835275.707513]  [<c0190a24>] oom_kill_process+0x144/0x290
May 22 18:01:01 DXS62-3 kernel: [4835275.707527]  [<c0190ff2>] __out_of_memory+0xd2/0x130
May 22 18:01:01 DXS62-3 kernel: [4835275.707542]  [<c01910ba>] out_of_memory+0x6a/0xc0
May 22 18:01:01 DXS62-3 kernel: [4835275.707556]  [<c019421f>] __alloc_pages_nodemask+0x56f/0x580
May 22 18:01:01 DXS62-3 kernel: [4835275.707617]  [<c01942fc>] __get_free_pages+0x1c/0x30
May 22 18:01:01 DXS62-3 kernel: [4835275.707633]  [<c0132668>] copy_process+0xa8/0x1030
May 22 18:01:01 DXS62-3 kernel: [4835275.707647]  [<c013370f>] do_fork+0x7f/0x390
May 22 18:01:01 DXS62-3 kernel: [4835275.707663]  [<c02ddf5d>] ? force_evtchn_callback+0xd/0x10
May 22 18:01:01 DXS62-3 kernel: [4835275.707690]  [<c0264713>] ? copy_to_user+0x43/0x60
May 22 18:01:01 DXS62-3 kernel: [4835275.707704]  [<c01026fb>] sys_clone+0x3b/0x50
May 22 18:01:01 DXS62-3 kernel: [4835275.707721]  [<c0104571>] syscall_call+0x7/0xb

 

When I checked the console free -m showed I had plenty of free memory, but I guess thats after the OOMkiller was called... I'm mostly a windows guy so this is a little hard to read, but I dont see anything in here leading up to the OOM event that triggered this..



James Cannon Citrix Employees
  • #10

James Cannon
  • 4,402 posts

Posted 23 May 2014 - 06:23 PM

Ok, so the OOMkill is an out of memory kill operation. What is interesting to me is a possible Machine Check Error:

 

May 22 18:01:01 DXS62-3 kernel: [4835275.707462] mcelog.cron invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0

May 22 18:01:01 DXS62-3 kernel: [4835275.707479] Pid: 3579, comm: mcelog.cron Not tainted 2.6.32.43-0.4.1.xs1.8.0.835.170778xen #1

 

 

MCEs are almost always hardware-related, possibly fixed with BIOS update (March is the latest firmware as noted on Dell web page).

 

Once you see the OOMKill, the only option is to reboot. As this is only affecting dom 0, you can use Remote Desktop to safely power off Guest VMs and when they are no longer pingable, power cycle the host.  I wish there was another way around this. The VMs are in no danger, but we have lost command and control (via XAPI) and cannot migrate VMs off for the reboot. I am real sorry about that.

 

I would definately recommend the hotfix 5 for XenServer 6.2 SP1, as the OOMkill are number 3 in the document. malloc is memory allocation.



Will Sours Members
  • #11

Will Sours
  • 13 posts

Posted 23 May 2014 - 06:50 PM



Ok, so the OOMkill is an out of memory kill operation. What is interesting to me is a possible Machine Check Error:

 

May 22 18:01:01 DXS62-3 kernel: [4835275.707462] mcelog.cron invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0

May 22 18:01:01 DXS62-3 kernel: [4835275.707479] Pid: 3579, comm: mcelog.cron Not tainted 2.6.32.43-0.4.1.xs1.8.0.835.170778xen #1

 

 

MCEs are almost always hardware-related, possibly fixed with BIOS update (March is the latest firmware as noted on Dell web page).

 

Once you see the OOMKill, the only option is to reboot. As this is only affecting dom 0, you can use Remote Desktop to safely power off Guest VMs and when they are no longer pingable, power cycle the host.  I wish there was another way around this. The VMs are in no danger, but we have lost command and control (via XAPI) and cannot migrate VMs off for the reboot. I am real sorry about that.

 

I would definately recommend the hotfix 5 for XenServer 6.2 SP1, as the OOMkill are number 3 in the document. malloc is memory allocation.

 

OK, I'll try installing that tomorrow morning and cross my fingers...If there's anything else I should check please let me know.

 

Thanks! I'll report back...