Jump to content


Photo

LB Configured Method and Current Method configuration problem

Started by Mohamed LAATABI , 09 February 2017 - 06:39 AM
8 replies to this topic

Mohamed LAATABI Members

Mohamed LAATABI
  • 7 posts

Posted 09 February 2017 - 06:39 AM

Hi guys, 
 
I have a netscaler VPX that I use to load balance traffic between 6 McAfee proxy servers. At first when I configured the VPX I used the round robin method, which worked well, but we wanted to go back to the default LEASTCONNECTION algorithm. I used the command below for each vserver:
 
for example:
 
set lb vserver VSERVER-X -lbMethod LeastConnection
 
when I do : show lb vserver VSERVER-X
 
I get:
 
 Configured Method: LEASTCONNECTION Current Method: Round Robin, Reason: LB method is changed
 
I have read some articles on the slow start mode, In my case I understand that it must be equal to  
 
 100 x 6 x 1 = 600 hits  (6= number of bound services , 1 = number of packet engines) 
 
The problem is : I have +4000 hits per second to this vserver and +4000 tcp connexions, but the loadbalancer does not leave the slow start mode and I still got round robin :(
 
Can someone please help ? I looked everywhere, but I did not find the solution.
 
I'm using a Netscaler SDX 11515 appliance, with a VPX "NetScaler NS10.5: Build 52.11.nc " for this particular farm of proxies. 
 
I would appreciate your help :)

 



Paul Blitz Members

Paul Blitz
  • 4,036 posts

Posted 09 February 2017 - 01:38 PM

How many users? Are you using any form of persistence?

 

If so, then once a connection is LB'd, it stays on the same server, and I suspect that new hits from the same client might not be then considered as a new connection from the LB perspective.... thus the 600 hits may not have yet been reached



Mohamed LAATABI Members

Mohamed LAATABI
  • 7 posts

Posted 09 February 2017 - 04:40 PM

Hi Paul,
 
I have about 40.000 users (if all users are browsing at the same time). These users are browsing through this VPX (and then through the proxies). I use source IP persistence that I have set to 120min.
 
I don't think it's related to persitence because after the change of LB method that I have made this morning, I compared the persistentSessions table before and after, and I can see +5000 new sticky connexions. (I waited about 2 hours for users to come to offices)
On citrix documentation we're also talking about the number of hits on the veserver (not the number of new connexions), this is why I'm confused about this behavious.
 
The Netscalers are configured on HA mode, 2 arm configuration. I do some rewriting and header injection but nothing related loadbalancing.

 

Mohamed



Paul Blitz Members

Paul Blitz
  • 4,036 posts

Posted 10 February 2017 - 09:47 AM

I guess we now need to find out how the LB's are setup... can you post the relevant parts of your ns.conf, that show the settings for the LBVS and its services / service group?



Raman Kaushik Citrix Employees

Raman Kaushik
  • 6 posts

Posted 10 February 2017 - 12:54 PM

The appliance can alternatively be configured to require that a specific given number of requests should pass through the virtual server before exiting the Slow Start mode. Run the following command to set this configuration by using the Startup RR Factor:

> set lbparameter startupRRFactor 5

If the appliance has seven packet engines with 6 services bound to the virtual server and the startup_rr_factor is 5, the virtual server exits the Slow Start mode when it reaches the following:
5 hits x bound services (6) x number of packet engines (7) = 210 hits (max)

By default the newly configured virtual server remains in a Slow Start mode for Startup RR Factor of 100.

NOTE: Make sure that none of the services are flapping from UP to DOWN. I would expect this to reset the slow start



Mohamed LAATABI Members

Mohamed LAATABI
  • 7 posts

Posted 13 February 2017 - 05:42 PM

Hi,

 

I have +4000 hits/sec on the vserver, so a startupRRFactor of 100 would make the Netsaler exit the slow start mode after less than 1 sec.

 

I have to understand why it's not working before scheduling a change :) 

 

Sorry Paul, I'm out of office with no access to the equipments. I will share parts of the ns.conf when I get back.

 

Thanks



Mohamed LAATABI Members

Mohamed LAATABI
  • 7 posts

Posted 10 October 2017 - 01:12 PM

Hi,

 

Here is the relevent info of the configuration of one vserver that causes the problem. I didn't do any modification to the slow start parameter. 

 

XXX-XXX-Kerberos (X.X.X.X:8080) - HTTP   Type: ADDRESS
        State: UP
        Last state change was at Mon Oct  9 22:31:32 2017
        Time since last state change: 0 days, 14:19:38.360
        Effective State: UP
        Client Idle Timeout: 180 sec
        Down state flush: ENABLED
        Disable Primary Vserver On Down : DISABLED
        Appflow logging: ENABLED
        Port Rewrite : DISABLED
        No. of Bound Services :  8 (Total)       6 (Active)
        Configured Method: LEASTCONNECTION
        Current Method: Round Robin, Reason: Bound service's state changed to UP
        Mode: IP
        Persistence: SOURCEIP   Persistence Mask: 255.255.255.255       Persistence Timeout: 120 min
        Vserver IP and Port insertion: OFF
        Push: DISABLED  Push VServer:
        Push Multi Clients: NO
        Push Label Rule: none
        L2Conn: OFF
        Skip Persistency: None
        IcmpResponse: PASSIVE
        RHIstate: PASSIVE
        New Service Startup Request Rate: 0 PER_SECOND, Increment Interval: 0
        Mac mode Retain Vlan: DISABLED
        DBS_LB: DISABLED
        Process Local: DISABLED
        Traffic Domain: 879

 

 

SERVICEGROUP - HTTP
        State: ENABLED  Effective State: PARTIAL-UP     Monitor Threshold : 0
        Max Conn: 0     Max Req: 0      Max Bandwidth: 0 kbits
        Use Source IP: YES              Use Proxy Port: YES
        Client Keepalive(CKA): NO
        TCP Buffering(TCPB): NO
        HTTP Compression(CMP): NO
        Idle timeout: Client: 180 sec   Server: 360 sec
        Client IP: DISABLED
        Cacheable: NO
        SC: OFF
        SP: OFF
        Down state flush: ENABLED
        Appflow logging: ENABLED
        Process Local: DISABLED
        Traffic Domain: 879

 

 

add lb vserver OUR_VSERVER HTTP X.X.X.X 8080 -persistenceType SOURCEIP -timeout 120 -cltTimeout 180 -td 879
bind lb vserver OUR_VSERVER OUR_SERVICE_G
bind lb vserver OUR_VSERVER -policyName Policy-Inject-VIP-IP -priority 100 -gotoPriorityExpression NEXT -type REQUEST
bind lb vserver OUR_VSERVER -policyName Policy-Inject-VIP-PORT -priority 110 -gotoPriorityExpression END -type REQUEST
add serviceGroup OUR_SERVICE_G HTTP -td 879 -maxClient 0 -maxReq 0 -cip DISABLED -usip YES -useproxyport YES -cltTimeout 180 -svrTimeout 360 -CKA NO -TCPB NO -CMP NO
bind lb vserver OUR_VSERVER OUR_SERVICE_G
bind serviceGroup OUR_SERVICE_G -monitorName Monitor-Probe-HTTP-CUSTOM

 

thanks for your help.



Paul Blitz Members

Paul Blitz
  • 4,036 posts

Posted 11 October 2017 - 12:58 PM

Working on the premise "persistence overrides the LB method", whilst you may have thousands of hits, if they are still within persistence, then they are probably NOT deemed to be being LB'd.... so you need to wait for a load of NEW connections that do not have any persistence.

 

Of course, if you wait overnight (which will have happened a few times by now!) the persistent sessions will have cleared, and you'll get a load of new session to actually LB.

 

I note that the service group is showing as "PARTIAL-UP".... are the member service 100% stable, or do they go up & down? Remember that each time a service goes down then up again, you'll get slow start kicking in.

 

I just found http://ronnyholtmaat.nl/citrix-netscaler-load-balancing-squid-which-acts-as-an-internet-forwarding-proxy/ ,  which is about LBing a Squid proxy.... looks similar to what you have, but might be worth a quick look.



Mohamed LAATABI Members

Mohamed LAATABI
  • 7 posts

Posted 12 October 2017 - 03:49 PM

Hi Paul,
 
That explains everything:
 
The persistence timeout I set is 2 hours so that the users will stick to the same real server all day (the real servers are McAfee proxies that have different public IP addresses.. we did that because some websites block users coming with different IP addresses, so we wanted the same user to use same public IP address all day long). The problem with this population of users is that they leave their PCs and browsers online, and our persistence parameter is source IP (XFF for other populations). With TCP keepalives the timer will always be refreshed and the persistence table always full (I will check that tonight). So as you said, even if the population of users is big, maybe we don't have new user in the morning ans that's why the algorithm don't change to least conn.
 
I think COOKIEINSERT would be great for our scenarion :) or set a script that flushes the persistence table evry morning. Or tune the startupRRFactor. 
 
For the PARTIAL-UP it's normal because we have 2 proxies out of production for now.
 
Thank you very much for the help.