Jump to content


Photo

XenServer 7.1 guests i686 CPU error on all VMs

Started by Jonathan Bailey , 12 August 2017 - 10:52 AM
14 replies to this topic

Best Answer Jonathan Bailey , 12 August 2017 - 10:24 PM

Just an update as I have now resolved the issue. One of the blades (hosts) wasn't happy somewhere with its hardware after the original reboot although Xen did not report anything and the host did boot correctly. This made Xen downgrade the CPU level to 32bit only causing all of the our 64bit VMs unable to boot. I imagined that having the other blades powered down would alter the CPU level and started with just the master. This is not the case and you need to actually remove the host from the pool rather than just having it powered down to affect the CPU level. As I didn't know which host was causing the issues I was just removing hosts one by one until the pool told me that the CPU level had been increased. Voila, the VMs were happily all booting again!

 

I do not blame Xen however it could of had a bit more intelligent information regarding which machine was causing it to downgrade the pool. I think this is just due to running older, less reliable hardware.

 

Thank you Alan for all your help.

Jonathan Bailey Members

Jonathan Bailey
  • 6 posts

Posted 12 August 2017 - 10:52 AM

We did a reboot of our servers and SAN just to change an ILO password. Xen has come back up fine and sees the SAN storage but all the VMs only see an i686 CPU. This is shown when we boot linux VMs that tells us this. Any Windows VMs simply stop after posts.

 

Linux kernel message shows:

This kernel requires an x86-64 CPU, but only detected an i686 CPU. 
Unable to boot - please use a kernel appropriate for your CPU.

 

No other changes were made the bios. Only the iLO was changed on the SAN.

 

Any ideas?

 



Alan Lantz Members

Alan Lantz
  • 7,442 posts

Posted 12 August 2017 - 01:41 PM

I would check BIOS again to make sure you didn't disable virtualization. Thats what it sounds like. A simple password change wouldn't keep VM's from booting.

 

--Alan--



Jonathan Bailey Members

Jonathan Bailey
  • 6 posts

Posted 12 August 2017 - 04:47 PM

Checked them and they all are set to enable in Intel Virtualisation. This was also seen in some commands running on the host itself (eg /proc/cpuinfo).

 

I have found "Pool CPU Features reduced" in the logs now though. Maybe this is the issue. All Blades are Xeon E5440 processors so not sure why this has been triggered. Would it go as low as limiting it to i686?

 

Thanks for your reply!



Alan Lantz Members

Alan Lantz
  • 7,442 posts

Posted 12 August 2017 - 05:06 PM

I would make sure BIOS is on the latest release. Maybe you have ran into some sort of BIOS bug that disables virtualization. Past that it would be going to HP (I'm assuming) to find out whats the deal.

 

--Alan--



Jonathan Bailey Members

Jonathan Bailey
  • 6 posts

Posted 12 August 2017 - 05:14 PM

Will check through that. They are very old G1 HP Blades though.

 

Is there any commands or tricks to get it to re-evaluate the CPU level? Or any way to check what it is currently limiting it to? Any details are I find are from before Xen v7.

 

Thanks again for your help.



Alan Lantz Members

Alan Lantz
  • 7,442 posts

Posted 12 August 2017 - 05:18 PM

Not that I'm aware of. Most commands pre-XenServer 7.x will work with XenServer 7.x. There are exceptions but I would say 90% of the commands you find relating to XenServer apply equally to all versions.

 

--Alan--



Jonathan Bailey Members

Jonathan Bailey
  • 6 posts

Posted 12 August 2017 - 10:24 PM

Just an update as I have now resolved the issue. One of the blades (hosts) wasn't happy somewhere with its hardware after the original reboot although Xen did not report anything and the host did boot correctly. This made Xen downgrade the CPU level to 32bit only causing all of the our 64bit VMs unable to boot. I imagined that having the other blades powered down would alter the CPU level and started with just the master. This is not the case and you need to actually remove the host from the pool rather than just having it powered down to affect the CPU level. As I didn't know which host was causing the issues I was just removing hosts one by one until the pool told me that the CPU level had been increased. Voila, the VMs were happily all booting again!

 

I do not blame Xen however it could of had a bit more intelligent information regarding which machine was causing it to downgrade the pool. I think this is just due to running older, less reliable hardware.

 

Thank you Alan for all your help.


Best Answer

Alan Lantz Members

Alan Lantz
  • 7,442 posts

Posted 13 August 2017 - 02:14 AM

That's an interesting solution. I know 7.x handles the cpu masking differently, but I never thought about a single host somehow bringing the whole pool level down. Haven't heard of that one before and glad to see you are up and running again.

 

--Alan--



Craig Falconer Members

Craig Falconer
  • 9 posts

Posted 25 August 2017 - 07:08 AM

This made Xen downgrade the CPU level to 32bit only causing all of the our 64bit VMs unable to boot. ...

 

I have a bunch of Dell C5220 blades that ran xenserver 6.5 perfectly well.  However a clean install of 7.2 leaves them all unable to run 64 bit kernels.  The CentOS7 installer ISO is all it takes to test this.

 

What showed you which pool member was wrong?  I cannot find "CPU Features reduced" in any log file.

 

Don't want to kick pool members one at a time - I booted one out with "destroy" and it needed a reinstall from scratch, but once that was done and before  joining the pool it could run 64 bit kernels in a VM.

When joining it to the pool I get a message

 

You are attempting to add the server 'c5220-3-6' to a pool that is using older CPUs

VMs running in the pool will only use the CPU features common to all the servers in the pool.

 

 

So its absolutely something in the pool that is hiding all the 64 bit CPUs.

 

Every host has the same identical CPUs too - intel xeon E3 1270v2 at 3.5GHz and 32 GB ram.

 

How can I finger the bad node ?  Or is it nuke the whole pool from orbit and start over time ?



Alan Lantz Members
  • #10

Alan Lantz
  • 7,442 posts

Posted 25 August 2017 - 01:15 PM

I would think with host-cpu-info you should be able to compare settings and tell which one is different.

 

--Alan--



Tobias Kreidl CTP Member
  • #11

Tobias Kreidl
  • 18,867 posts

Posted 25 August 2017 - 04:14 PM

Interesting... a Dell R730 Xeon v4 shows:

# xe host-cpu-info
cpu_count       : 56
    socket_count: 2
          vendor: GenuineIntel
           speed: 2400.056
       modelname: Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz
          family: 6
           model: 79
        stepping: 1
           flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc arch_perfmon rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm fsgsbase bmi1 hle avx2 bmi2 erms rtm rdseed adx xsaveopt cqm_llc cqm_occup_llc
        features: 7ffefbff-bfebfbff-00000121-2c100800
     features_pv: 17c9cbf5-f6f83203-2191cbf5-00000123-00000001-000c0b39-00000000-00000000-00000000
    features_hvm: 17cbfbff-f7fa3223-2d93fbff-00000123-00000001-001c0fbb-00000000-00000000-00000000

 

and a Dell R720 Xeon v2 shows:

# xe host-cpu-info
cpu_count       : 40
    socket_count: 2
          vendor: GenuineIntel
           speed: 2800.074
       modelname: Intel® Xeon® CPU E5-2680 v2 @ 2.80GHz
          family: 6
           model: 62
        stepping: 4
           flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc arch_perfmon rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor est ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms xsaveopt
        features: 7fbee3ff-bfebfbff-00000001-2c100800
     features_pv: 17c9cbf5-f6b82203-2191cbf5-00000003-00000001-00000201-00000000-00000000-00000000
    features_hvm: 17cbfbff-f7ba2223-2d93fbff-00000003-00000001-00000281-00000000-00000000-00000000

 

so I'm not sure how the host-cpu-info could be helpful in this case, in particular because these hosts work just great in this pool, yet have quite different characteristics and metrics.

 

-=Tobias



Alan Lantz Members
  • #12

Alan Lantz
  • 7,442 posts

Posted 25 August 2017 - 06:41 PM

Interesting. If flags are the same in a pool thats been downgraded maybe enabling messages to see if something hits that log? Odd you have to keep kicking slaves out of a pool unitl you run across the right one. It has to be logged somewhere.

 

--Alan--



Jonathan Bailey Members
  • #13

Jonathan Bailey
  • 6 posts

Posted 29 August 2017 - 02:49 PM

I had to simply removed nodes from the pool until I found the offending one. I was glad to find the culprit early on.

 

The only message I had was in XenCenter telling me that the CPU level was increased in the messages there. I couldn't get any of the older commands to work on Xen V7 or newer so was stuck finding which node caused it. In my case all the CPUs were the same series of Xeons, just a couple with different clock speeds although I think the issue was with the blade having failing hardware.

 

HTH



Craig Falconer Members
  • #14

Craig Falconer
  • 9 posts

Posted 02 September 2017 - 06:07 AM

I would think with host-cpu-info you should be able to compare settings and tell which one is different.

 

--Alan--

 

Yeah I thought so too.... but every node reported the same flags.

 

In the end I was disassembling the pool and found the bad node - I'd managed to live-migrate 7production VMs into the pool using --force from the CLI on another pool - they were all running 64 bit but probably would have failed to reboot!

 

By right-clicked a running VM to migrate it, live-migrate menu showed all 12 hosts in the pool and that node9 claimed

"the host does not have some of the CPU features that the VM is currently using"

 

https://criggie.org.nz/pictures/xen.png

 

As soon as I kicked node9 from the pool, then all was well and 64 bit VMs worked fine.

 

Curiously, 64 bit VMs worked perfectly on the node once it was out of the pool too.  So something got confused and carried through.

 

I did a fresh xenserver 7.2 install on node9 and it worked fine as well, and is now member of a pool again like it should be.



Alan Lantz Members
  • #15

Alan Lantz
  • 7,442 posts

Posted 02 September 2017 - 01:19 PM

Thanks for that update. Thats a very odd bug I hope they fix soon.

 

--Alan--