Jump to content


Photo

XenServer7 and Dell Compellent issues (SC4020)

Started by Calinescu Dragos , 06 February 2017 - 03:02 PM
19 replies to this topic

Calinescu Dragos Members

Calinescu Dragos
  • 13 posts

Posted 06 February 2017 - 03:02 PM

Dear all,

 

We currently run a blades chassis, namely M1000 with two Dell M630 blades connected via 2 top-of-the-rack N4032F switches. All setup is 10Gb, all Dell certified interconnected via 10Gb twinax cables, all redundant. The two XenServer 7 instances are connected into a Pool. All updates done !

 

The initial setup comprised a Dell SCV2020 storage and XenServer7 as Dom0 via 2 x 10Gb Ethernet, one for each controller.

 

Further upgrade was the acquisition of two SC4020 storages set up in a LiveVolume configuration, linked to the same setup via 10Gb twinax cables in the same switch

 

With Compellent SCV2020 everything was a breeze. VM's we're running fine and with a decent performance between 300-350 Mbps in disk write, very fast ( 10 seconds ) start-up and shutdown of amchines and migrate/copy/transfer speeds of ~100-120 Gb per hour

 

Problems raised as soon as the 2xSC4020 cluster was set up. All the VM's linked to this storage are starting up or shutting down in around 4-6 minutes. No exception related to VM OS type or disk size. It's just freezing after the first 5 seconds and at  4-6 minutes it finishes starting up

 

All  kind of maintenance on those  2xSC4020 iSCSI SR's is a real horror. Beside the fact that the VM's have slow shutdown and start times any kind of migration within the pool is made at 1/3 off the SCV2020 speed no matter is a simple disk copy on the same storage or between storage's. Migrating a live machine is a entire heart attack, mapping of the iSCSI SR's after a restart takes 10 minutes.

 

Things seem to be in relation with the number of isCSI connections. On SCV2020 we had 2 iSCSI and on the 2xSC4020 cluster we have obviously 4. If we only connect 2 of those 4 (for instance we do not use in this cluster a Live Volume but a regular volume, hence only 2 iSCSI connections), the bootup and shutdown time are about halved at about 2 minutes.

 

Further investigations made during the last months with Dell left the system's updated also in the XenServer Dom0, switches and also in he controllers area at what they say in the compatability matrix to be fine. Double, triple checked according to their deployment guide.

Multipath is enabled showing 4 of 4 paths active (8 iSCSI sessions) for a regular LiveVolume. No errors in multipath -ll 

VM's once started are working at a decent speed ( 250-300 Mbps) but clear slower than SCV2020

 

Log's of a VM' start-up linked to a LiveVolume on 4 Ip's 

 

In SM.log first 4 lines are those few seconds before it freezes :

 

Feb  6 16:45:25 static-10-0-0-34 SM: [24092] lock: opening lock file /var/lock/sm/iscsiadm/running
Feb  6 16:45:25 static-10-0-0-34 SM: [24092] lock: acquired /var/lock/sm/iscsiadm/running
Feb  6 16:45:25 static-10-0-0-34 SM: [24092] lock: released /var/lock/sm/iscsiadm/running
Feb  6 16:45:25 static-10-0-0-34 SM: [24092] lock: closed /var/lock/sm/iscsiadm/running

 

 

then it's frozen for 20 seconds and first erors are occuring:

 

Feb  6 16:46:10 static-10-0-0-34 SM: [24092] ['/usr/lib/udev/scsi_id', '-g', '-s', '/block/sdf']
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] FAILED in util.pread: (rc 1) stdout: '', stderr: '/usr/lib/udev/scsi_id: invalid option -- 's'
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] '
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] getSCSIid failed on sdf in iscsi /dev/iscsi/iqn.2002-03.com.compellent:5000d31000e67222/10.10.2.6:3260: LUN offline or iscsi path down
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] path /dev/iscsi/iqn.2002-03.com.compellent:5000d31000e6701f/10.10.2.9:3260
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] iscsci data: targetIQN iqn.2002-03.com.compellent:5000d31000e6701f, portal 10.10.2.9
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] lock: opening lock file /var/lock/sm/iscsiadm/running
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] lock: acquired /var/lock/sm/iscsiadm/running
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] lock: released /var/lock/sm/iscsiadm/running
Feb  6 16:46:10 static-10-0-0-34 SM: [24092] lock: closed /var/lock/sm/iscsiadm/running

 

The process is repeated for all 4 iSCSI ip's of the LUN 
Then throwing a bunch of errors:
 
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] FAILED in util.pread: (rc 1) stdout: '', stderr: '/usr/lib/udev/scsi_id: invalid option -- 's'
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] '
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] getSCSIid failed on sdd in iscsi /dev/iscsi/iqn.2002-03.com.compellent:5000d31000e67222/10.10.2.6:3260: LUN offline or iscsi path down
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] ['/usr/lib/udev/scsi_id', '-g', '--device', 'sde']
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] FAILED in util.pread: (rc 1) stdout: '', stderr: ''
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] ['/usr/lib/udev/scsi_id', '-g', '-s', '/block/sde']
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] FAILED in util.pread: (rc 1) stdout: '', stderr: '/usr/lib/udev/scsi_id: invalid option -- 's'
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] '
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] getSCSIid failed on sde in iscsi /dev/iscsi/iqn.2002-03.com.compellent:5000d31000e67222/10.10.2.6:3260: LUN offline or iscsi path down
Feb  6 16:49:16 static-10-0-0-34 SM: [25582] ['/usr/lib/udev/scsi_id', '-g', '--device', 'sdf']

 

 
And finally after 4-6 minutes is starting:
 
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] tap.activate: Launched Tapdisk(vhd:/dev/VG_XenStorage-cc0a18e1-e75f-37e3-308d-c4022ea7226d/VHD-fd2f588d-8370-471a-ad02-2157b1369fff, pid=27129, minor=2, state=R)
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] Attempt to register tapdisk with RRDD as a plugin.
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] ERROR: Failed to register tapdisk with RRDD due to UnixStreamHTTP instance has no attribute 'getresponse'
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] lock: released /var/lock/sm/lvm-cc0a18e1-e75f-37e3-308d-c4022ea7226d/lvchange-p
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] DeviceNode(/dev/sm/backend/cc0a18e1-e75f-37e3-308d-c4022ea7226d/fd2f588d-8370-471a-ad02-2157b1369fff) -> /dev/xen/blktap-2/tapdev2
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] lock: closed /var/lock/sm/lvm-cc0a18e1-e75f-37e3-308d-c4022ea7226d/lvchange-p
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] lock: released /var/lock/sm/fd2f588d-8370-471a-ad02-2157b1369fff/vdi
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] lock: closed /var/lock/sm/cc0a18e1-e75f-37e3-308d-c4022ea7226d/sr
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] lock: closed /var/lock/sm/fd2f588d-8370-471a-ad02-2157b1369fff/vdi
Feb  6 16:51:35 static-10-0-0-34 SM: [25582] lock: closed /var/lock/sm/cc0a18e1-e75f-37e3-308d-c4022ea7226d/sr

 

Any ideea what may occur this strange things ? I mean same blade servers ( and regular servers ) are mapped also to the older SCV2020 and to the new cluster of 2xSC4020. The only different thing is the SR's mapped. No other changes as there is same XenServer installation

 

Thank you gentleman's

 

 

 
 

 

 

 

 

 

 



Damir Zagar Members

Damir Zagar
  • 15 posts

Posted 13 February 2017 - 10:33 AM

Please check my response in the thread 'Scan target fails on XenServer 7 and Compellent SC9000'. Maybe SC4020 has the same issue as SC9000.

 

Which version of SCOS are you running on SC4020?



Calinescu Dragos Members

Calinescu Dragos
  • 13 posts

Posted 13 February 2017 - 11:13 AM

Dear Damir,

 

We're running the 7.1.4.4 SCOS on the 4020's as recommended and uploaded by Dell. Till 27th of February the XenServer 7 was not officially supported - since then is supported both on Dell and Citrix pages with SCOS 7.1.x.x

 

Strange thing is that with XenServer 6.5 in the same setup it works flawless. Any startup/shutdown takes place in less than 10 seconds. Once XenServer 7 is installed everything is turned upside/down and we have this problems

 

I read somewhere here on the forum that someone adjusted iscsi.py in order to get rid of some similar issues - I'll take a look to see if that stderr: '/usr/lib/udev/scsi_id: invalid option -- 's " might have any references to this file and if there is any kind of adjustment who can get rid of this incredible issue

 

Many thanks Damir

 "



Damir Zagar Members

Damir Zagar
  • 15 posts

Posted 13 February 2017 - 07:44 PM

We are using the same 7.1.4.4 on our SC9000.

 

Just for a test - try to configure SC4020 with the single path - and I think it will probably work OK.

 

I've noticed same messages with SC9000 when it is configured to use multipath (MPIO) - in our case with two separate networks (VLANs) and virtual configuration, so I think they are related to multipath configuration or rather - how Storage Center is handling it.

 

What is the multipath.conf you are using on XenServer 7? There is a default 'COMPELNT' device section in multipath.conf which is used if you don't supply a custom one (check multipathd -k"show conf") which probably have some issues.



Calinescu Dragos Members

Calinescu Dragos
  • 13 posts

Posted 14 February 2017 - 08:46 AM

Damir

 

With 2 paths the speed time for restart/shutdown is halved . I did not try with just one path but I will and i'm sure that the speed will be fine.

 

Still we need those 4 paths as we have the SC4020 cluster and LiveVolumes

 

I'm using the default multipath.conf in XenServer 7. Through my intensive internet searches I could not find a better suitable ones than the original. I made at some point some modification but either there were not important or it was worse

 

As far as I understand your SC9000 is in the same more or less functionality stage ? What have you done ?

 

What is frustrating is the fact that Xen 6.5 is working fine and for the last 2 months we're stuck with an ~100k solution limited in functionality. Dell passed everything to Citrix, even if they say they are compatible....



Damir Zagar Members

Damir Zagar
  • 15 posts

Posted 14 February 2017 - 09:44 AM

SCOS is probably the same across all SC platforms and we are currently stuck with the same situation.

 

I've tried to escalate our problem with CoPilot and waiting for the resolution.

 

Probably we are not alone in this and I hope that someone is going to offer solution soon.



Calinescu Dragos Members

Calinescu Dragos
  • 13 posts

Posted 14 February 2017 - 10:11 AM

I believe that indeed SCOS is the same on the entire platform

 

I was unable to solve anything with Dell as they requested to have a valid XenServer license and we're using the free version :(. They keep hanging on networking issues, xfer latency, Jumbo Frames and so on even if same networking setup is working perfect in SC4020+Xen6.5 or SCV202+Xen7 combinations. From me is certainly a XenServer 7 bug as you saw the logs attached

 

Or we need a special multipath.conf for SC series. 

 

Looking forward to see if CoPilot will solve your problem



Damir Zagar Members

Damir Zagar
  • 15 posts

Posted 14 February 2017 - 11:41 AM

Probably XS bug ... in handling execution of 'scsi_id' in /opt/xensource/sm/scsiutil.py in getSCSIid(path).

 

/usr/lib/udev/scsi_id is invoked with wrong option, so I made the following change

 

        #stdout = util.pread2([SCSI_ID_BIN, '-g', '-s', '/block/%s' % dev])

        stdout = util.pread2([SCSI_ID_BIN, '-g', '/dev/%s' % dev])
 
Now it seems to be working, but we need to verify...


Calinescu Dragos Members

Calinescu Dragos
  • 13 posts

Posted 14 February 2017 - 12:23 PM

Jesus !!!

 

IT's working !!!!!!! 10 seconds for startup and shutdown. SMlog is error free . Almost 

 

I tested only to my test machine but I will replicate the setting to the Live machines and let all of you know !!!!

 

Damir, where should I send a pizza ? :D And a hug ?

 

Geee !!!!!  :P



Calinescu Dragos Members
  • #10

Calinescu Dragos
  • 13 posts

Posted 14 February 2017 - 12:33 PM

The only errors in SMLog are those:
 
But it's working fine
 
Should we worry ?
 
 
 [29773] FAILED in util.pread: (rc 1) stdout                                                                                                                                                              : '', stderr: ''
Feb 14 14:31:34 static-10-0-0-34 SM: [29773] ['/usr/lib/udev/scsi_id', '-g', '/d                                                                                                                                                              ev/sde']
Feb 14 14:31:34 static-10-0-0-34 SM: [29773]   pread SUCCESS
Feb 14 14:31:34 static-10-0-0-34 SM: [29773] dev from lun sde 36000d31000e672000                                                                                                                                                              000000000000051
Feb 14 14:31:34 static-10-0-0-34 SM: [29773] ['/usr/lib/udev/scsi_id', '-g', '--                                                                                                                                                              device', 'sdf']
Feb 14 14:31:34 static-10-0-0-34 SM: [29773] FAILED in util.pread: (rc 1) stdout                                                                                                                                                              : '', stderr: ''
Feb 14 14:31:34 static-10-0-0-34 SM: [29773] ['/usr/lib/udev/scsi_id', '-g', '/d                                                                                                                                                              ev/sdf']
Feb 14 14:31:34 static-10-0-0-34 SM: [29773]   pread SUCCESS


Damir Zagar Members
  • #11

Damir Zagar
  • 15 posts

Posted 14 February 2017 - 12:33 PM

Croatia is a bit far away for pizza delivery, but I will accept call for a beer once I'm in your neighbourhood (Rumunia?) :)

 

Hope that there will be a XE bugfix soon so we don't have to take care about it in the future...



Calinescu Dragos Members
  • #12

Calinescu Dragos
  • 13 posts

Posted 14 February 2017 - 12:42 PM

Well is not so far from Bucharest :) Anytime for a beer, no matter who's first in the other's area  :)

 

Is there any way we can send this to XE developers for looking at it and maybe including it in a future update ?



Damir Zagar Members
  • #13

Damir Zagar
  • 15 posts

Posted 14 February 2017 - 01:35 PM

I've sent bug report to xen-api list.

 

I think you can ignore errors related to execution of the first ('/usr/lib/udev/scsi_id', '-g', '--device', 'sdf') command line.

 

Anyway... we (Compellent users) need a proper fix, and I hope we are going to get one :)



Jakub Kramarz Members
  • #14

Jakub Kramarz
  • 5 posts

Posted 14 February 2017 - 10:31 PM

Doesn't work for me, this fix fails when creating SR on XenServer 7.0.0 and SCv2000.

https://github.com/xapi-project/sm/issues/345



Calinescu Dragos Members
  • #15

Calinescu Dragos
  • 13 posts

Posted 15 February 2017 - 10:38 AM

Jakub,

 

Never had problems with Xenserver 7 and SCV2020 . What is your problem ? could you detail ?

 

For us this is working just fine. I have deleted the SR and join it agaian and it's working just fine

 

Thank you



Jakub Kramarz Members
  • #16

Jakub Kramarz
  • 5 posts

Posted 15 February 2017 - 03:13 PM

Which operating system did you select creating server cluster in storage configuration?

 

I've "XenServer 7.x MPIO" selected and I'm running latest SCOS (7.1.2.7).

Every single SR operation fails when I select "*" as Target IQN, but works correctly when I select specific iSCSI Portal (and abandon multipath).

 

From XenServer perspective it looks just like tgtd configured as in the issue I've linked above.



Calinescu Dragos Members
  • #17

Calinescu Dragos
  • 13 posts

Posted 15 February 2017 - 03:37 PM

XenServer 7.x MPIO - but I have a smaller SCOS: 6.6.11.9. And SCV2020

 

I beleieve you have enabled multipath in XenServer ? I know that sounds stupid. What multipath -ll says ?



Jakub Kramarz Members
  • #18

Jakub Kramarz
  • 5 posts

Posted 16 February 2017 - 11:19 AM

Despite that I've enabled multipathing, I've at most 1 iSCSI session in the same time during, so multipath refuses to consider resource as multipath-capable and whole operation fails. I can't find a way of downgrading it to 6.6.11, so can't check if it's "a new feature".



Calinescu Dragos Members
  • #19

Calinescu Dragos
  • 13 posts

Posted 16 February 2017 - 11:28 AM

Have you tried to connect the iSCSI paths manually ? But to discover them ?

 

iscsiadm -m discovery -t st -p 192.168.x.x

 

And then to connect all of them :

 

Like : iscsiadm -m node -l 

 

 

If you prefer to login an individual iSCSI target the following command can be issued:

#iscsiadm -m node -T <Complete Target Name> -l -p <Group IP>:3260

Example:

#iscsiadm -m node -l -T iqn.2001-05.com.equallogic:83bcb3401-16e0002fd0a46f3d-rhel5-test -p 172.23.10.240:3260



Jakub Kramarz Members
  • #20

Jakub Kramarz
  • 5 posts

Posted 16 February 2017 - 10:15 PM

Yes, if I connect iSCSI paths manually, during SR operation, everything works fine.