Jump to content


Photo

Issues with extreme slowness/freezing when PVS vDisk is in maintenance mode (write)

Started by Michael Vinter , 27 December 2016 - 12:49 PM
9 replies to this topic

Michael Vinter Members

Michael Vinter
  • 12 posts

Posted 27 December 2016 - 12:49 PM

Hi,

We just upgraded our PVS servers and infrastructure to v7.11 & XenServer 7, and after that we have severe issues with freezing and slowness when we put our dynamic vDisks into maintenance mode for upgrading. When we run the vDisks in production mode we see only minor issues with 5-10 retries errors in the vDisk properties field (PVS agent). But when we try to write changes to the disk it goes on for a couple of seconds, freezes after anything from 30-500 retries. Machine is still pingable, but refuses to respond to commands.

The issues started when we upgraded our XenServer infrastructure from 6.5 to XenServer 7. At first we thought this was related to PVS v7.6 FR3 was not fully compatible with XenServer 7. But now when we have upgraded our PVS servers from v7.6 FR3 -> v7.11 this issue still remains. So it seems to be partially due to XenServer 7.

 

Things we have tried together with Citrix support:

 

PVS v7.11.0.6 running on a Windows Server 2012R2 – Exclusions for virus protection, diskabletaskoffload=1, firewalls off etc..
PVS streamed server running on a Windows Server 2012R2 - diskabletaskoffload=1, BNSStack parameters changes back and forth to get rid of errors (not helping), we have tries with or without all Send/Receive offloads set on the NIC, IPv6 disabled, firewalls off etc.

 

After putting Hotfix XS70E13 on the XenServers we use for our developement environment it doesn´t freeze as often, but we still get hundreds of retries. Any idés out there or sharing the same experience?

 

/ Michael



BOGDANST Members

Bogdan Stanciu
  • 37 posts

Posted 13 January 2017 - 09:33 AM

I had the same issue and I had to change the MTU of the PVS servers below 1506 which is the default. It dropped from 800 retries to 1-2.

After changing the MTU from PVS console>server proprieties, reboot all of them.

Do not performed any change if you have SQL offline enabled. Disabled it first and the make modifications to PVS, restart them and then enable it back.



Michael Vinter Members

Michael Vinter
  • 12 posts

Posted 13 January 2017 - 09:54 AM

Thanks Bogdan. Will try it an see if it helps. 

 

I just sent some more information to Citrix support which is a bit wierd to say the least.

 

Now I have done some extensive testing and have finally come to a conclution regarding the PVS hosts freezing. The reason is still unknown though and I hope someone have an idéa why this previously worked with the same PVS streamed image to a XenServer 6.5 and not to a XenServer 7.

 

Scenario 1: (Image in multi-device ”available for maintenance mode”) :

 

Even after just booting up the Windows Server 2012R2 OS it had 24 retries. I initiated a file copy and the image froze fully after 20-30 seconds and only 271MB of data written to the disk.

 

Scenario 2: (Image in private mode ”available for direct writes from one single VM”) :

 

Booting up and logging on was much faster. No retries after bootup and I could successfully copy 2.7GB of data without a single retry.

 

Scenario 1: High peak visible in VMware disk monitor showing 10500ms write latency. This is when writing to the PVS image in ”maintenence mode”. It freezes after writing only 271MB to the disk.

 

Scenario 2: After that I switched the image to scenario 2 with the image in ”Private image” and did the exact same write test by copying a lot of large and small files. I could write all files to the system without even a single retry or ANY write peaks to the disk system (LUN).

 

Wierd to say the least..



BOGDANST Members

Bogdan Stanciu
  • 37 posts

Posted 13 January 2017 - 10:24 AM

Did Citrix asked you to uninstall the AV?

 

Anyway, the best way to exclude some components is to put the target device (in multidevice readonly) on the same XS host where PVS is running and run a defrag and virus scan at the same time on Target Device.

Install CDF trace on PVS and enable debug on target as Citrix will ask for this.https://support.citrix.com/article/CTX138698

 

If you don´t see more than one retry per minute, this is good. This would eliminate any issue with PVS, target device configuration and storage(depends how is configured when is moved) and you can focus on the XS network firmware, switches, etc....

 

I had also this https://support.citrix.com/article/CTX200952 but I think they fix it in 711. 

 

Good luck!



Michael Vinter Members

Michael Vinter
  • 12 posts

Posted 13 January 2017 - 12:20 PM

Yes,

 

 I even installed a brand new 2012R2 and used PVS wizard to create a new image without any previous contamination from AV or any other applications that can interfere with the PVS communication. It works slightly better but still freezes from time to time.

 

I have set the MTU to 1500 now and it's unfortunately the same.

 

Citrix support could see excessive IRP's to the disk (see below with the outcome of a Windows memory dump check):

 

  1. Dump shows 1781 Queued IRPs to the disk. We can also see 4 outstanding bnistack requests.

2.       From looking at the bnistack threads, these seem to be busy processing write transactions.

3.       What we see on the dump is consistent with a storage related issue.



BOGDANST Members

Bogdan Stanciu
  • 37 posts

Posted 13 January 2017 - 03:55 PM

from 1506 to 1500 is not too much. 

The MTU I have configured is 1360. Try with a lower one and run the test with defrag and scan with the PVS and TG on the same hypervisor host.

 

"dump is consistent with a storage related issue" - I believe you are using Cache in ram with overflow on HDD. How big is the ram and the disk attached? Did you try to move the disk attached to a different storage?



Michael Vinter Members

Michael Vinter
  • 12 posts

Posted 16 January 2017 - 08:38 AM

Tested with MTU 1360 but the error is persistant and still causes lockups. What I did though was copying the same image to my local SSD drive and published that to the PVS server with a UNC path instead of local block storage provided by our VMWare environment (local disk to the PVS server). Also tried with a remote UNC path on a slow NAS and strangly enough both WORKS.

 

So the error seems to be related to having PVS 7.11 -> XenServer 7.0 -> PVS server hosted in VMWare 5.5 with local block storage.

 

But two things is weird:

 

1. Never had these issues until we upgraded XenServer from 6.5 -> 7.0

2. This issue ONLY occures with the disk in multi-mode and not in private mode when writing changes to the imagefile (needless if it's static or dynamic).



BOGDANST Members

Bogdan Stanciu
  • 37 posts

Posted 16 January 2017 - 12:29 PM

Well... at list you know is the storage.

What type of storage do you have? Look at drivers, firmwares updates. Run a service status report from Xencenter and upload the collected files to TAAS to analise. Maybe you missed to update a driver, etc

 

If not, move the vdisk to XS local storage with multiple PVS servers in LB.

 

When the vdisk is in multidevice mode, all the writes are going to the PVS server where the vdisk is located. When is in readonly, the writes go to RAM (if you have cache on RAM with overflow on HDD) and then to HDD when RAM used for PVS is filled.



Michael Vinter Members

Michael Vinter
  • 12 posts

Posted 16 January 2017 - 12:38 PM

It's a Hitachi HUS-150 with Fcal. It's connected via 2*10Gbit LAN/iSCSI from our HP C7000 blade chassis infrastructure, so it's rather a complex setup with both physical and virtual switches (HP Virtual Connect infrastructure).

 

As I wrote regarding writing. Writing can be done in two ways (reading is not an issue). Either in multi device mode (snapshot based with the image in maintenance mode) or in direct write mode (private mode). It's only in traditional multi-device mode (maintenance) mode this weird write overload occures.

 

Next step will be to move the PVS servers away from our VMWare infrastructure to a couple of LB:d servers outside the blade chassi infrastructure. But it would be nice to know why this issue never existed before upgrading to XenServer 7. Seems like a SCSI lockup issue.



Michael Vinter Members
  • #10

Michael Vinter
  • 12 posts

Posted 30 January 2017 - 03:26 PM

Follow up:

 

We solved the issues now with freezing when writing to a PVS image in maintanence mode. This issue occured due to several factors:

 

  • PVS disk image was in thin provisioned mode
  • VMware LUN hosting the PVS images was also thin provisioned
  • VMware environment got SCSI lockups due to the same SCSI adapter for both LUN:s. Fixed by assigning dedicated SCSI adapter for the LUN hosting the PVS images

Everything is back to normal now!