Load Balancing methods do not work as expected between WEB and APP servers.

Hakan Polatli1709158891 · April 3, 2019

Hi Guys,

I have a customer complaining that least connection load balancing method does not work as it is supposed to. First, since I noticed that the current configuration method is Round Robin becasue "Bound service's state changed to UP" when looking into the vServer configuration, I thought it might be cause of Slow-Start mode, by which NetScaler protects newly enabled or up servers from a possible overhead, but then I found that it's not the root cause as servers' state does not change so often for Slow-Start Mode to switch the load balancing method from configured one to Round Robin.

Regarding Slow-Start Mode:

https://support.citrix.com/article/CTX108886?_ga=2.163941951.1802433873.1553506206-1079685509.1515480303

Anyway, although least connection method is configured, we observe that some servers have 11 connections whereas some of them have 3 so there is always a huge gap between the server that has the most connection and the server that has the least connection. vServer in question is SSL_BRIDGE. Connection multiplexing is not in use which is obvious when comparing the "Current Client Connections" and "Current Server Connections" at vServer statistics as these values are identical for each service bound to the vServer.

I attached vServer and servicegroup configuration, connection count at vServer as well as two graphs taken from APM tool, one of them shows the number of connection requests to the servers and the other one shows the number of active connections at the servers. NetScaler version is 11.1 Build 58.13.

The customer changed the load balancing method to Round Robin and they're still not satisfied. Is it normal to face such a case as some connections may take longer than the others or is there anything I am possibly missing? This a traffic between WEB servers and APP servers which is something I am not so familiar with.

Any help or idea would be great as I really don't know what to do. Load balancing is the main task that this appliance perform so I think this shouldn't be a problem :)

Thanks,

Hakan

lb-method.rar

Mihai Cziraki1709160741 · April 4, 2019

Hi!

So the last time the Vserver went into Round Robin mode is 2 days . Service 1, MOBAPP02 is the one that has the lower uptime.

In that article says that it will leave the rpundrobin method only when it reaches

request rate x number of packet engines x bound services

I don't know what your request rate is/was or how many packet engines you have.

So if we say you had 100 requests/s x 7 packet engines x 13 services = 9100

so only after 9100 client hits (tcp connections in your case) it will leave the round robin method.

I am guessing you vserver did not hit the limit number so it stays with round robin.

i found this article that might help you:

https://docs.citrix.com/en-us/netscaler/12/load-balancing/load-balancing-advanced-settings/slow-start-service.html

"With automated slow start, a service is taken out of the slow start phase when one of the following conditions applies:

The actual request rate is less than the new service request rate.

The service does not receive traffic for three successive increment intervals.

The request rate has been incremented 200 times.

The percentage of traffic that the new service must receive is greater than or equal to 100.

"

Give this a try.

Hakan Polatli1709158891 · April 7, 2019

On 04.04.2019 at 11:14 AM, Mihai Cziraki1709160741 said:

Hi!

So the last time the Vserver went into Round Robin mode is 2 days . Service 1, MOBAPP02 is the one that has the lower uptime.

In that article says that it will leave the rpundrobin method only when it reaches

request rate x number of packet engines x bound services

I don't know what your request rate is/was or how many packet engines you have.

So if we say you had 100 requests/s x 7 packet engines x 13 services = 9100

so only after 9100 client hits (tcp connections in your case) it will leave the round robin method.

I am guessing you vserver did not hit the limit number so it stays with round robin.

i found this article that might help you:

https://docs.citrix.com/en-us/netscaler/12/load-balancing/load-balancing-advanced-settings/slow-start-service.html

"With automated slow start, a service is taken out of the slow start phase when one of the following conditions applies:

The actual request rate is less than the new service request rate.

The service does not receive traffic for three successive increment intervals.

The request rate has been incremented 200 times.

The percentage of traffic that the new service must receive is greater than or equal to 100.

"

Give this a try.

Hi Mihai,

Thanks for the comment which encouraged me to take a deep dive into this. I have read the command reference as well as the link you sent.

When looking into the command reference:

https://developer-docs.citrix.com/projects/netscaler-command-reference/en/11.0/lb/lb-vserver/lb-vserver/

newServiceRequestIncrementInterval

Interval, in seconds, between successive increments in the load on a new service or a service whose state has just changed from DOWN to UP. A value of 0 (zero) specifies manual slow start. Default value: 0 Minimum value: 0 Maximum value: 3600

-My configuration is set to 0 so manual slow start is in use.

newServiceRequest

Number of requests, or percentage of the load on existing services, by which to increase the load on a new service at each interval in slow-start mode. A non-zero value indicates that slow-start is applicable. A zero value indicates that the global RR startup parameter is applied. Changing the value to zero will cause services currently in slow start to take the full traffic as determined by the LB method. Subsequently, any new services added will use the global RR factor. Default value: 0 Minimum value: 0

-My configuration is set to 0 so global RR startup parameter is in use.

I have 6 cores and 13 services bound to this vServer. I also suspected that my vServer remains in slow start for a long time but for this to happen, I thought services state must change frequently. Therefore, when I noticed that last time one of the servers state changed 2 days ago, I said ok then this is not the case. However, as you said, I may need to observe the vServer hits.

So my vServer exits slow start mode when it reaches;

100 requests/s x 6 packet engines x 13 services = 7800 hits.

That is, from the time at which MOBAPP02's state changed to UP untill the time this servicegroup configuration was taken, my vServer might have not received 7800 hits. So I will check the counter of vServer hits between Fri Mar 29 15:30:25 - Mon Apr 01 12:23:37 (Mar 29 15:30:25 + 2 days, 20:53:12)

1) 1.1.1.1:443 State: UP Server Name: MOBAPP02 Server ID: None Weight: 1
Last state change was at Fri Mar 29 15:30:25 2019
Time since last state change: 2 days, 20:53:12.550

nsconmsg -K /var/nslog/newnslog.xx -d current -g vsvr_tot_Hits -T 7 | grep vserver_ip

Or do you know a direct way of seeing for how long this vServer had been using Round Robin during these 2 days + 21 hours.

Thanks,

Hakan

Hakan Polatli1709158891 · April 8, 2019

Update:

I found how long it takes for the vserver to get 7800 hits by analyzing newnslog, which is 21 hours, but it was useless as there is something I missed. The default value for startupRRFactor as given in the below link is for a newly configured vServer.

https://support.citrix.com/article/CTX108886?_ga=2.243877189.1561745537.1554466477-1079685509.1515480303

By default the newly configured virtual server remains in a Slow Start mode for Startup RR Factor of 100.

When I have look at the command reference to find a description for "set lb parameter startupRRFactor" command, I found this:

https://developer-docs.citrix.com/projects/netscaler-command-reference/en/12.0/lb/lb-parameter/lb-parameter/

For an existing virtual server, if one or more services are newly bound or newly enabled, or if the load balancing method is changed, the appliance dynamically computes the number of requests for which to implement startup round robin. It obtains this number by multiplying the request rate by the number of bound services (it includes services that are marked as DOWN). For example, if the current request rate is 20 requests/s and ten services are bound to the virtual server, the appliance performs startup round robin for 200 requests.

Most of the time the request rate of vServer in question is between 0-1 (most of the time it doesn't get 7 requests during any 7 seconds of period so when the delta value is less than 7, it shows 0 for request rate). So in this case, for this vServer to exit slow start, it must receive the following requests:

1 (req rate) x 6 (core) x 13 (services bound to the vserver)=78 hits.

if it does use the rational numbers to calculate, possible values;

(2/7) x (6) x (13) ≈ 22 hits.

(19/7) x (6) x (13) ≈ 212 hits

Assuming worst case scenario has occured, that is a service state changed to up when request rate is 2 (delta=19), even if the vserver receives only 1 request at each 7 seconds from the moment the service state's change, it would exit slow start after 1484 seconds ((212/1) x 7) which is approximately equal to 25 minutes and it's not as long as we predicted. It's even not close.

Log analysis sample:

nsconmsg -K /var/nslog/newnslog.xx.tar.gz -d current -g vsvr_tot_Hits

reltime:mili second between two records Fri Apr 5 15:08:01 2019
Index rtime totalcount-val delta rate/sec symbol-name&device-no
1412 0 99121 2 0 vsvr_tot_Hits vserver_lb_9.9.9.9:443(MOBAPP_PX12_443_VIP)

reltime:mili second between two records Fri Apr 5 15:24:35 2019
Index rtime totalcount-val delta rate/sec symbol-name&device-no
2052 0 199757292 3265 466 vsvr_tot_Hits vserver_cs_0.0.0.0:443(COMOBWEB_BANK_HTTPS_CS_VIP)

2054 0 99268 1 0 vsvr_tot_Hits vserver_lb_9.9.9.9:443(MOBAPP_PX12_443_VIP)
2055 7000 199748605 3091 441 vsvr_tot_Hits vserver_lb_0.0.0.0:0(COMOBWEB_BANK_HTTPS_VIP)
2058 0 99270 2 0 vsvr_tot_Hits vserver_lb_9.9.9.9:443(MOBAPP_PX12_443_VIP)
2059 0 466040 6 0 vsvr_tot_Hits vserver_cs_8.8.8.8:80(COMOBWEB_BANK_HTTP_CS_VIP)
2072 7000 199761849 3271 467 vsvr_tot_Hits vserver_lb_0.0.0.0:0(COMOBWEB_BANK_HTTPS_VIP)

2076 0 99277 7 1 vsvr_tot_Hits vserver_lb_9.9.9.9:443(MOBAPP_PX12_443_VIP)

2077 7000 199765213 3364 480 vsvr_tot_Hits vserver_lb_0.0.0.0:0(COMOBWEB_BANK_HTTPS_VIP)
2078 0 99296 19 2 vsvr_tot_Hits vserver_lb_9.9.9.9:443(MOBAPP_PX12_443_VIP)

To sum it up, the problem seems like not related to slow start. Or am I missing something again? What can I do for leastconnection method to work as expected? TCP Multiplexing is not even in use because the vServer is SSL_BRIDGE. I don't have any idea what else could be the reason.

Thanks,

Hakan

Mihai Cziraki1709160741 · April 9, 2019

you probably need to open a ticket with Citrix, they might advise on this.

what is not clear to me is this :

"For a virtual server that is already configured and is serving the production traffic, when the services are enabled or the services are UP, the time to exit Slow Start is calculated using the following calculation:
Request rate = current instance value - previous instance value (before 7 seconds)"

I think it will best to ask Citrix. Or maybe somebody else knows better and will reply on this thread.

Hakan Polatli1709158891 · April 10, 2019

Thanks for your comments Mihai.

You're right. I asked customer to open a ticket for this.

I assume "current instance value" correspond to the value under totalcount-val column of nslog, accordingly previous instance value is the value 7 second ago. But when you do this calculation it does not give you a rate, it gives you the number of hits which you must divide by 7 in order to find the rate.

I think the correct notation should be like this:

Request rate: (current instance value-previous instance value) / 7

in which "current instance value-previous instance value" correspond to delta in nslog.

But this explanation from the command reference makes more sense. I think current request rate correspond to the value under rate/sec column of nslog

https://developer-docs.citrix.com/projects/netscaler-command-reference/en/12.0/lb/lb-parameter/lb-parameter/

When I have look at the command reference to find a description for "set lb parameter startupRRFactor" command, I found this:

https://developer-docs.citrix.com/projects/netscaler-command-reference/en/12.0/lb/lb-parameter/lb-parameter/

For an existing virtual server, if one or more services are newly bound or newly enabled, or if the load balancing method is changed, the appliance dynamically computes the number of requests for which to implement startup round robin. It obtains this number by multiplying the request rate by the number of bound services (it includes services that are marked as DOWN). For example, if the current request rate is 20 requests/s and ten services are bound to the virtual server, the appliance performs startup round robin for 200 requests.

Hakan Polatli1709158891 · May 11, 2019

A late update: As soon as we changed the vServer's protocol from SSL_BRIDGE to SSL, LB method in question started to work as expected.

FYI

Sign In

Load Balancing methods do not work as expected between WEB and APP servers.

Recommended Posts

Hakan Polatli1709158891

Link to comment

Share on other sites

Mihai Cziraki1709160741

Link to comment

Share on other sites

Hakan Polatli1709158891

Link to comment

Share on other sites

Hakan Polatli1709158891

Link to comment

Share on other sites

Mihai Cziraki1709160741

Link to comment

Share on other sites

Hakan Polatli1709158891

Link to comment

Share on other sites

Hakan Polatli1709158891

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Discussions

Netscaler

Citrix

Technical Articles

Tech Insights

Community Articles

Resources

Events

Education