Jump to content


Photo

Xenserver 6.2 Crash - BUG ?

Started by MrDigit , 21 July 2013 - 09:00 PM
54 replies to this topic

Best Answer MrDigit , 02 May 2014 - 08:37 AM

Hello,

 

first big thanks to Andrew for all the efforts ! My problem was solved by disabling MSI on my areca RAID controller.

MSI is enabled only with the current kernel driver vom the areca website, not with the kernel module shipped with XenServer.

 

I have added

 

options arcmsr msix_enable=0

 

into /etc/modprobe.conf to disable MSI.

 

Best regards

  MrDigit

MrDigit Members

Thimo Eichstädt
  • 30 posts

Posted 21 July 2013 - 09:00 PM

Hello all,

I've been using XenServer 6.0.2 for a long time without problems. Last week I've migrated to XenServer 6.2.
Today Xenserver 6.2 crashed and restarted, attached you'll find logfiles of the crash directory.

Is it a Bug or did I something wrong ?

Best regards
MrDigit

here is the output of xen.log:

(XEN) __csched_vcpu_acct_start: setting dom 0 as the privileged domain
(XEN) ----[ Xen-4.1.5 x86_64 debug=n Not tainted ]----
(XEN) CPU: 1
(XEN) RIP: e008:[<ffff82c4801561c0>] context_switch+0x850/0xe80
(XEN) RFLAGS: 0000000000010282 CONTEXT: hypervisor
(XEN) rax: 0000000000000000 rbx: ffff8300cb6ec280 rcx: 0000000000000000
(XEN) rdx: 000000000000002c rsi: ffff83081f02fd2c rdi: ffff8300cb6ec000
(XEN) rbp: ffff8300cb6ec000 rsp: ffff83081f02fd38 r8: ffff83081f038c90
(XEN) r9: 0000000000000002 r10: 00024b9329caa068 r11: ffff82c4801a7c90
(XEN) r12: ffff8300b8586000 r13: ffff82c4802ce1a0 r14: ffff83081f038060
(XEN) r15: 0000ad25c7e6a33b cr0: 000000008005003b cr4: 00000000001026f0
(XEN) cr3: 0000000808bd9000 cr2: 00000000000f0451
(XEN) ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0000 cs: e008
(XEN) Xen stack trace from rsp=ffff83081f02fd38:
(XEN) ffff82c480155b40 ffff82c480155b40 0000000000000000 0000000000000002
(XEN) 0000000000000000 0000000000000000 0000000000000000 ffff8305d892a000
(XEN) ffff83081f004000 ffff8300cb6ec000 ffff83081f005580 0000ad25c7e69ced
(XEN) ffff82c4802ea4c0 ffff82c4802ce1a0 0000000000000082 ffff82c4802ea4c0
(XEN) 0000ad25c7e6a33b ffff83081f02fe70 ffff83081f0b4c60 ffff830808be8d20
(XEN) 0000000001c9c380 0000000000000000 ffff8300cb6ec000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000ad25c87f3f09
(XEN) ffff83081f004000 ffff8300cb6ec000 ffff8300b8586000 ffff83081f005580
(XEN) ffff83081f038060 0000ad25c7e6a33b ffff82c480122421 ffff83081f022010
(XEN) 0000000000000001 ffff83081f038040 ffff83081f038040 ffff8300cb6ec000
(XEN) 0000000001c9c380 ffff8300cb6ec000 ffff82c48014eeac ffff83081f03edd8
(XEN) ffff82c480126854 ffff82c4801b68d6 0000000000000001 ffffffffffffffff
(XEN) ffff83081f02ff18 ffff82c4802b7580 ffff82c4802bf580 0000ad25c7ce9c5d
(XEN) ffff82c4801240f5 ffff83081f02ff18 ffff8300b8586000 ffff8300b858a000
(XEN) ffff82c4802ce1a0 ffff83081f038060 ffff82c480154905 0000000000000001
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 00000000ee8a3f8c 0000000000000001 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 00000000ee8a3f74 000000000000ad25 0000000000000001 0000010000000000
(XEN) 00000000c01013a7 0000000000000061 0000000000000246 00000000ee8a3f70
(XEN) Xen call trace:
(XEN) [<ffff82c4801561c0>] context_switch+0x850/0xe80
(XEN) 0[<ffff82c480155b40>] context_switch+0x1d0/0xe80
(XEN) 1[<ffff82c480155b40>] context_switch+0x1d0/0xe80
(XEN) 34[<ffff82c480122421>] schedule+0x2c1/0x820
(XEN) 42[<ffff82c48014eeac>] reprogram_timer+0xbc/0xd0
(XEN) 44[<ffff82c480126854>] timer_softirq_action+0x154/0x220
(XEN) 45[<ffff82c4801b68d6>] hvm_vcpu_has_pending_irq+0x76/0xd0
(XEN) 52[<ffff82c4801240f5>] __do_softirq+0x65/0x90
(XEN) 58[<ffff82c480154905>] idle_loop+0x25/0x50
(XEN)
(XEN) Pagetable walk from 00000000000f0451:
(XEN) L4[0x000] = 000000042f899027 0000000000011083
(XEN) L3[0x000] = 000000042f898027 0000000000011082
(XEN) L2[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) **************************************
(XEN) Panic on CPU 1:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 00000000000f0451
(XEN) **************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) Executing crash image

Attached Files



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,718 posts

Posted 21 July 2013 - 09:24 PM

Are you running the latest BIOS/firmware updates? Is your hardware all listed as supported under XS 6.2 on the _HCL_?
--Tobias



MrDigit Members

Thimo Eichstädt
  • 30 posts

Posted 21 July 2013 - 09:49 PM

Hallo,

thanks for your response.

Latest BIOS/firmware is installed. It is a haswell chipset mainboard, quite new but haswell shall be supported by XenServer 6.2. Intel NICs running with e1000e driver also OK. I can't find my areca storage adapter in the HCL.

Memory already checked some hours with memtest86+...

Best regards
Thimo



Tobias Kreidl CTP Member

Tobias Kreidl
  • 18,718 posts

Posted 21 July 2013 - 10:30 PM

Seems these errors have come and gone several times over the last several years, such as reported _here_.
--Tobias



Andrew Cooper Citrix Employees

Andrew Cooper
  • 77 posts

Posted 22 July 2013 - 10:25 AM

Sigh.

We have seen similar before, but I thought we had fixed all occurences. It appears not.

Can you run "rpm -qa | grep xen" so I can confirm the exact version of Xen you are using?



MrDigit Members

Thimo Eichstädt
  • 30 posts

Posted 22 July 2013 - 10:42 AM

Hello Andrew,

thanks for your response. I am running the current XenServer 6.2 with all available patches applied. Here the rpm output:

kernel-xen-2.6.32.43-0.4.1.xs1.8.0.835.170778
xen-device-model-1.8.0-84.7551
md3000-rdac-modules-xen-2.6.32.43-0.4.1.xs1.8.0.835.170778-09.03.0C05.0642-1120
firmware-qlogic-netxen-4.0.588-1
xapi-xenopsd-0.2-5669
xen-hypervisor-4.1.5-1.8.0.578.23791
xen-firmware-4.1.5-1.8.0.572.23774
xenserver-lsb-4.0-2.1.4.xs1120
openvswitch-modules-xen-2.6.32.43-0.4.1.xs1.8.0.835.170778-1.4.6-141.9924
xapi-xenops-0.2-5669
splashy-graphics-xenserver-0.3.9-xs1120
xenserver-transfer-vm-6.2.0-70314c
xen-tools-4.1.5-1.8.0.578.23791

Best regards
MrDigit



Andrew Cooper Citrix Employees

Andrew Cooper
  • 77 posts

Posted 22 July 2013 - 12:02 PM

Thanks,

I have raised a bug and am looking in to it right now.



Andrew Cooper Citrix Employees

Andrew Cooper
  • 77 posts

Posted 22 July 2013 - 01:05 PM

Ok,

Good and bad news. I think I have worked out what is going on, but it is not obvious what the cause is.

context_switch+0x1d0/0xe80 is a function pointer call to vcpu->arch.schedule_tail
context_switch+0x850/0xe80 is a misaligned instruction, which suggests a bad function call.

It a guess, I suspect a racecondition invoving a use-after-free on the vcpu information, where the function pointer has been overwritten.

Attached is Xen with full debugging enabled. Copy both the files into /boot/ and tweak the bootloader to boot this instead of xen.gz

Can you also edit /etc/sysconfig/kdump and add "--dom0-symtab" to XCA_CMDLINE_EXTRA ?

This will cause the crashdump analyser to hexdump each struct vcpu and struct domain, so in the case of a crash the relevent function pointers can be verified. I am however hoping that the full debug version of Xen will catch the error rather closer to its source.

Attached Files



MrDigit Members

Thimo Eichstädt
  • 30 posts

Posted 22 July 2013 - 01:46 PM

Hello Andrew,

thanks a lot for your really fast response. I'll install the new version today, reboot the server and will come back to you when the hypervisor crashes again.

Hopefully the performance impact of the debug version is not too high.

Best regards
Thimo



Andrew Cooper Citrix Employees
  • #10

Andrew Cooper
  • 77 posts

Posted 22 July 2013 - 01:48 PM

Unless you are running microbenchmarks, you shouldn't notice a difference in performance.



Andrew Cooper Citrix Employees
  • #11

Andrew Cooper
  • 77 posts

Posted 22 July 2013 - 02:20 PM

I realise I have been a complete idiot with the suggest for the crashdump analyser.

The extra command line parameter should be "--dump-structures" not "--dom0-symtab".

You can update /etc/sysconfig/kdump then run "/etc/init.d/kdump start" to reload the crash environment with correct command line arguments without rebooting



MrDigit Members
  • #12

Thimo Eichstädt
  • 30 posts

Posted 22 July 2013 - 03:57 PM

Hello Andrew,

ok, thank you for the correction. I've done it like you told me, debug version is running. Now I'll wait.

Best regards
Thimo



MrDigit Members
  • #13

Thimo Eichstädt
  • 30 posts

Posted 27 July 2013 - 08:34 AM

Hello,

this night XenServer crashed again, but unfortunately it did not generate a crash dump. I can't see anything in the logfiles except the normal kernel boot messages.

I've checked the kdump settings again but it looked quite good:

>Jul 27 10:07:34 localhost kdump: Setting up crash kernel:
>Jul 27 10:07:34 localhost kdump: Crash kernel: /boot/vmlinuz-2.6.32.43-0.4.1.xs1.8.0.835.170778kdump
>Jul 27 10:07:34 localhost kdump: Crash ramdisk: /boot/initrd-2.6.32.43-0.4.1.xs1.8.0.835.170778kdump.img
>Jul 27 10:07:34 localhost kdump: Crash kernel command line: root=LABEL=root-logtaqlb ro console=tty0 >quiet vga=785 splash kdump-xenversion=4.1.5.debug kdump-linuxversion=2.6.32.43-0.4.1.xs1.8.0.835.170778xen irqpoll maxcpus=1 reset_devices no-hlt
>Jul 27 10:07:34 localhost kdump: Loaded crash kernel

So I'll have to wait until the next crash.

MrDigit



MrDigit Members
  • #14

Thimo Eichstädt
  • 30 posts

Posted 27 July 2013 - 08:44 AM

Or could the be something wrong with the xen 4.1.5 debug version ?



Andrew Cooper Citrix Employees
  • #15

Andrew Cooper
  • 77 posts

Posted 27 July 2013 - 09:51 AM

That debug version of Xen is using the same set of debugging code which we develop most of a project with, so I doubt that it is problematic, although the fact it causes the crash logs not to happen is curious.

Did it reboot after the crash, or did you have to manually power cycle?

Are you able to attach up a serial connection?



MrDigit Members
  • #16

Thimo Eichstädt
  • 30 posts

Posted 27 July 2013 - 10:46 AM

Hello Andrew,

it did a reboot automatically. I can't see if it rebooted into the crash image or rebooted directly through the BIOS into the normal Xen Kernel again.

Serial connection will be somewhat difficult because the board doesn't have a native serial port.
But Serial through USB would be possible, I am m not sure if that is working in the crash kernel ?!

Best regards
Thimo



Andrew Cooper Citrix Employees
  • #17

Andrew Cooper
  • 77 posts

Posted 28 July 2013 - 06:37 PM

Hi,

While new versions of Xen do have support for USB debug consoles, Xen 4.1 does not. I have no idea whether linux 2.6.32 would be able to cope or not.

As for the automatic reboot, I was looking for confirmation that it didn't wedge. That suggests that the first crash turned into a cascade fault.

Just for sanity sake, can you check you have an up-to-date bios and system firmware? (I realise this is starting to clutch at straws)



MrDigit Members
  • #18

Thimo Eichstädt
  • 30 posts

Posted 28 July 2013 - 11:08 PM

Hello,

the BIOS ist the latest one, the firmware of the RAID controller is the latest, too.

I am really a little helpless at the moment. What do you mean with cascade fault ? You think its a hardware problem or is that your straw ? :)

I am now trying to get a serial PCI card which is linux compatible and then I'll try to get the serial console working.

Do you have any other ideas ? This is a Intel haswell board, do you already have experiences with the haswell chipset ?

Best regards
Thimo



MrDigit Members
  • #19

Thimo Eichstädt
  • 30 posts

Posted 29 July 2013 - 07:04 AM

Hello,

ok, another crash occured. This time the system was unresponsive, no ping, no keyboard input possible and only black, blank screen. Did not see anything, very mysterious. Don't know if all crashes have the same source.

Did not try SysRq (my fault), but num lock LED not not work either.
And did not know that xen watchdog timeout is 5 minutes, thought something about 5 seconds...so rebooted hard before the 5 minutes timeout.

Best regards
Thimo



Andrew Cooper Citrix Employees
  • #20

Andrew Cooper
  • 77 posts

Posted 29 July 2013 - 09:39 AM

You can safely reduce the watchdog timeout to 5 seconds, and indeed I have done on trunk recently. The 300 seconds is somewhat legacy from the time when the watchdog was introduced.

From your previous post, I see you are uisng a haswell box. I am not aware of any particular bugs, but we have not verified its functionalty yet; still trying to get some production hardware, as it will be sufficiently different from the pre-rpduction hardware we currently have.