Jump to content


Photo

Problem with MPP drivers: Cmnd-failed try alt ctrl

Started by Pekka Panula , 08 February 2011 - 08:08 AM
6 replies to this topic

Pekka Panula Members

Pekka Panula
  • 109 posts

Posted 08 February 2011 - 08:08 AM

We have 5.6.0 FP1 under BladeServer IBM HS22V, Qlogic QMI2572 adapters, storage IBM DS4700. Multipathed system.

We are getting every night in about same time every day these kind of errors:
494 [RAIDarray.mpp]Blade:1:0:1 Cmnd-failed try alt ctrl 0. vcmnd SN 18122322 pdev H0:C0:T1:L1 0x05/0x94/0x01 0x080
00002 mpp_status:1

There's several of those messages, and it seems to switch sometimes to secondary storage controller.
mppUtil -g0 shows that everything is ok.

I have contancted IBM support, but atm they are clueless and 5.6.0 FP1 is not IBM supported OS so we are getting limited support, but they have looked our configuration and storage support data and they say everything should be ok. So what they (IBM) and i am thinking this can be some sort of driver problem, either Qlogic Linux driver, MPP driver or something kernel related problem, we dont know actually why those errors are coming. No errors on daily use, event though i have tried to make lots of IOPS using performance VM, no errors there. Also no FC errors on storage controllers, IBM has checked those.

I have been tried different QLogic driver (older version, 8.03.00.14.11.0-k4) and with newer MPP driver but without any help.

Storage system is running with latest firmware as also Qlogic HBA adapters, all are using latest firmwares what IBM can offer.

We are clueless...We have older BladeServer HS21XM, running with xs 5.6.0 and its using MPP driver, no errors there.



Marcos Silva Members

Marcos Silva
  • 130 posts

Posted 08 February 2011 - 02:03 PM

Hi Pekka,

I'm facing the same problem. In the previous version of Xenserver we need to disable AVT in the storage side and follow the CTX125403 to enable mpp support.
With 5.6 FP1 we need only to run the commnad # /opt/xensource/libexec/mpp-rdac --enable and restart the server.
My best guess is that the cause of our problem is the mpp.conf file.
In the earlier versions we need to change the values of DisableLUNRebalance and FailbackToCurrentAllowed in the mpp.conf.
The mpp-rdac --enable changes only the DisableLUNRebalance value.
I wanna change this value and see what happens but I cant do right now because there a lot of users using this environment.
We are planing to do this change this weekend.
Can you change this value, restart the servers and tell me if the problem was gone before this weekend?

The values must be:

DisableLUNRebalance=3
*FailbackToCurrentAllowed=0*

Regards,

Marcos Silva



Pekka Panula Members

Pekka Panula
  • 109 posts

Posted 15 February 2011 - 08:33 AM

We still have problem with IBM HS22V + DS4700.
I installed 5.6.0 FP1 using multipath option, and it was using DMP.

Last night kernel log was full of I/O errors. So there is some mystic I/O errors. Does not matter if using MPP or DMP. IBM has checked that configuration should be ok.

On storage side they has been marked as LNXCLUSTER type. It should have AVT, etc. disabled.

IBM thinks this might be some sort of qla2xxx driver problem. I have tried to downgrade it for older version but no help there.

I was just checking firmware, 5.03.09 is current firmware, and i tried to check if there is newer and ftp://ftp.qlogic.com/outgoing/linux/firmware/ shows there is a newer one available, but i think firmware loading does not work at all. Because on boot kernel says:

kernel: qla2xxx 0000:24:00.0: Firmware image unavailable.
kernel: qla2xxx 0000:24:00.0: Firmware images can be retrieved from: ftp://ftp.qlogic.com/outgoing/linux/firmware/.
kernel: qla2xxx 0000:24:00.0: FW: Loading from flash (20000)...
kernel: qla2xxx 0000:24:00.0: Allocated (64 KB) for FCE...
kernel: qla2xxx 0000:24:00.0: Allocated (64 KB) for EFT...
kernel: qla2xxx 0000:24:00.0: Allocated (1350 KB) for firmware dump...
kernel: scsi0 : qla2xxx
kernel: qla2xxx 0000:24:00.0: LOOP UP detected (4 Gbps).
kernel: qla2xxx 0000:24:00.0:
kernel: QLogic Fibre Channel HBA Driver: 8.03.03.11.5.7-k0
kernel: QLogic QMI2572 - QLogic 4Gb Fibre Channel Expansion Card (CIOv) for IBM BladeCenter
kernel: ISP2532: PCIe (5.0GT/s x4) @ 0000:24:00.0 hdma-, host#=0, fw=5.03.09 (95)
kernel: qla2xxx 0000:24:00.1: PCI INT B -> GSI 42 (level, low) -> IRQ 42
kernel: qla2xxx 0000:24:00.1: Found an ISP2532, irq 42, iobase 0xf071e000
kernel: qla2xxx 0000:24:00.1: Configuring PCI space...
kernel: qla2xxx 0000:24:00.1: Configure NVRAM parameters...

I have checked that initrd does have lib/firmware/ql2500_fw.bin but it does not use it.
Anyone knows is it possible to put newer firmware file to use?

Funny thing is that when i do testing on day, like doing heavy IOPS testing, i dont have a single error, but on night at clock 06:00 (AM) it starts its errors.

Edited by: Pekka Panula on 15.2.2011 3:40



Gabor Zele Members

Gabor Zele
  • 13 posts

Posted 29 April 2011 - 01:48 AM

Hi Pekka!

I have the same issue, it seems that 5.6 FP1 initrd is unable to load the firmware of the qla2xxx driver for some unknown reason. For 5.6.0 it worked fine.

I use older qla 2312 (2Gb) cards on LS21 blades, the only difference that the driver is unable to load firmware at all (either from thisk nor the onboard flash).

Actually the driver works if it's invoked after initrd processing, but this prohibits boot from SAN of course. Multipathing is problematic of course as the default implementation tries to load the MPP-RDAC and qla2xxx from initrd too.

When I have loaded the drivers manually after boot, I've also seen some error messages something like this, but I eventually rendered the isntall unbootable :) so I wasn't able to do further testing.

Are you booting off the SAN?

I have inspected teh differences between 5.6 and 5.6 FP1 and there was a change in the included mkinitrd and nash versions, what I suspect faulty in the case of firmware loading.



Pekka Panula Members

Pekka Panula
  • 109 posts

Posted 02 May 2011 - 08:14 AM

I did manage to solve my problem by compiling IBM version of MPP driver, which is newer than stock 5.6.0 FP1 driver. But firmware file i did not manage to upgrade, but it was not needed.

I do boot some servers with multipathed SAN, some servers have local disks, so multipathing is easier to get work with those ones.



Pekka Panula Members

Pekka Panula
  • 109 posts

Posted 02 May 2011 - 08:15 AM

I did manage to solve my problem by compiling IBM version of MPP driver, which is newer than stock 5.6.0 FP1 driver. But firmware file i did not manage to upgrade, but it was not needed.

I do boot some servers with multipathed SAN, some servers have local disks, so multipathing is easier to get work with those ones.



veena edattale Members

veena edattale
  • 2 posts

Posted 19 December 2012 - 06:38 AM

hi Pekka,

I have the same problem currently with the excaat same errros and the storage is also the same. The rdac version is 09.03.0C05.0638. Currently even i am clueless. We updated the qlogic drivers but still we are getting this error. We even replaced sfp and fc cables but with no real luck . What did you do to resolve the problem.

Dec 19 04:39:55 chbs-bia1-03 chbs-bia1-03 kernel: [1270155.559527] 494 [RAIDarray.mpp]PROD_DS3524_DC165_SN13K0LCK:1:0:10 Cmnd-failed try alt ctrl 0. vcmnd SN 1870128 pdev H1:C0:T1:L10 0x05/0x94/0x01 0x08000002 mpp_status:1
Dec 19 04:39:55 chbs-bia1-07 chbs-bia1-07 kernel: [1265787.857565] 494 [RAIDarray.mpp]PROD_DS3524_DC165_SN13K0LCK:1:0:8 Cmnd-failed try alt ctrl 0. vcmnd SN 2262354 pdev H2:C0:T1:L8 0x05/0x94/0x01 0x08000002 mpp_status:1
Dec 19 04:39:56 chbs-bia1-08 chbs-bia1-08 kernel: [1270091.136487] 494 [RAIDarray.mpp]PROD_DS3524_DC165_SN13K0LCK:1:0:6 Cmnd-failed try alt ctrl 0. vcmn