geode
geode copied to clipboard
GEODE-10056: Improve gateway-receiver load balance
The problem is that servers send incorrect gateway-receiver connection load to locators within CacheServerLoadMessage. Additionally, locators do not refresh gateway-receivers load with the load received in CacheServerLoadMessage. The only time locator increments gateway-receiver load is after it receives ClientConnectionRequest{group=__recv_group...} and returns selected server in ClientConnectionResponse message. The client sends a ClientConnectionRequest to the one locator from list received in RemoteLocatorJoinResponse (initial list of locators) or LocatorListRequest (periodically updated list of locators). The received list is always sorted by the host address and port. The client will send ClientConnectionRequest following the sorted list of locators (from first to last) until a successful outcome. That means that the same locator (first one in the list) will handle all connection requests in normal conditions, and other locators will not update their gateway-receivers connection load.
The solution is to track the gateway-receiver acceptor connection count correctly and, based on it, accurately calculate the load when sending CacheServerLoadMessage. Additionally, each locator will read the load received from CacheServerLoadMessage and update the gateway-receivers connection load in group __recv__group accordingly
For all changes:
-
[x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
-
[x] Has your PR been rebased against the latest commit within the target branch (typically
develop
)? -
[x] Is your initial contribution a single, squashed commit?
-
[x] Does
gradlew build
run cleanly? -
[x] Have you written or updated unit tests to verify your changes?
-
[ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Hi reviewers,
With this solution, each server will now send CacheServerLoadMessage containing the correct connection load of the gateway-receiver to all locators in the cluster. This action will happen every 5 seconds as configured with the load-poll-interval parameter. Additionally, the coordinator locator will increase the load each time it provides the server location to the remote gateway-sender in ClientConnectionRequest/ClientConnectionResponse. Locator only maintains load temporarily until CacheServerLoadMessage is received. This behavior makes sense as the server tracks connection load more accurately than the locator. Locator only increases connection load based on the received connection requests while server adjusts the connection load each time connection is established and disconnected.
ClientConnectionRequest messages are usually sent to the locator in bursts when the gateway-sender is establishing connections due to traffic. This behavior results in the locator's connection load being way ahead of the server connection load because servers did not establish those connections yet. Suppose during these bursts CacheServerLoadMessage message come to locator carrying low load value for one of the gateway-receivers. In that case, that receiver will be picked more frequently (will have the lowest load), resulting in unbalanced gateway-sender connections. In order for this to have a big impact on load-balancing of sender connections the gateway-receivers must be started with some small delay, so that CacheServerLoadMessages are sent with some delay that is enough to cause imbalance. If CacheServerLoadMessages were sent at the similar time then this would not be a problem as all messages would have similar load and would update locator at similar time.
I would be really grateful if you could share your opinion on this matter?
I'm not sure how to resolve the race condition you mention, but I see similar behavior with client/server connections.
If a burst of connections is requested and none of those are made before the next load is received from the server, then the locator's load for that server gets reset back to zero.
A burst of connections (10 in this case) causes the load to go from 0.0 to 0.012499998:
[warn 2022/03/15 14:38:37.905 PDT locator <locator request thread 1> tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection potentialServers={192.168.1.5:[email protected](server1:30200)<v1>:41001=LoadHolder[0.0, 192.168.1.5:51249, loadPollInterval=5000, 0.00125]}
[warn 2022/03/15 14:38:37.906 PDT locator <locator request thread 1> tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadBeforeUpdate=0.0
[warn 2022/03/15 14:38:37.907 PDT locator <locator request thread 1> tid=0x24] XXX LoadHolder.incConnections location=192.168.1.5:51249; load=0.00125
[warn 2022/03/15 14:38:37.907 PDT locator <locator request thread 1> tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadAfterUpdate=0.00125
...
[warn 2022/03/15 14:38:38.005 PDT locator <locator request thread 1> tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection potentialServers={192.168.1.5:[email protected](server1:30200)<v1>:41001=LoadHolder[0.011249999, 192.168.1.5:51249, loadPollInterval=5000, 0.00125]}
[warn 2022/03/15 14:38:38.005 PDT locator <locator request thread 1> tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadBeforeUpdate=0.011249999
[warn 2022/03/15 14:38:38.005 PDT locator <locator request thread 1> tid=0x24] XXX LoadHolder.incConnections location=192.168.1.5:51249; load=0.012499998
[warn 2022/03/15 14:38:38.005 PDT locator <locator request thread 1> tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadAfterUpdate=0.012499998
If none of those connections are made before the next load is sent by that server, its load goes from 0.012499998 to 0.0:
[warn 2022/03/15 14:39:25.140 PDT locator <P2P message reader for 192.168.1.5(server1:30200)<v1>:41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateLoad about to update connectionLoadMap location=192.168.1.5:51249; load=0.0; loadPerConnection=0.00125
[warn 2022/03/15 14:39:25.140 PDT locator <P2P message reader for 192.168.1.5(server1:30200)<v1>:41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateMap location=192.168.1.5:51249; loadBeforeUpdate=0.012499998
[warn 2022/03/15 14:39:25.141 PDT locator <P2P message reader for 192.168.1.5(server1:30200)<v1>:41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateMap location=192.168.1.5:51249; loadAfterUpdate=0.0
[warn 2022/03/15 14:39:25.141 PDT locator <P2P message reader for 192.168.1.5(server1:30200)<v1>:41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateLoad done update connectionLoadMap location=192.168.1.5:51249
The load for the next request starts is 0.0 again:
[warn 2022/03/15 14:39:33.475 PDT locator <locator request thread 2> tid=0x54] XXX LocatorLoadSnapshot.getServerForConnection potentialServers={192.168.1.5:[email protected](server1:30200)<v1>:41001=LoadHolder[0.0, 192.168.1.5:51249, loadPollInterval=5000, 0.00125]}
[warn 2022/03/15 14:39:33.475 PDT locator <locator request thread 2> tid=0x54] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadBeforeUpdate=0.0
[warn 2022/03/15 14:39:33.475 PDT locator <locator request thread 2> tid=0x54] XXX LoadHolder.incConnections location=192.168.1.5:51249; load=0.00125
[warn 2022/03/15 14:39:33.475 PDT locator <locator request thread 2> tid=0x54] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadAfterUpdate=0.00125
...
One thing to note is that the load is only sent load-poll-interval (default=5 seconds) if it has changed. If it hasn't changed then it only gets sent every update frequency (which is 10 * 5 seconds by default).
There is a boolean to control that frequency too:
private static final int FORCE_LOAD_UPDATE_FREQUENCY = getInteger(
GeodeGlossary.GEMFIRE_PREFIX + "BridgeServer.FORCE_LOAD_UPDATE_FREQUENCY", 10);
The load-poll-interva is configurable, but currently only for the cache server not the gateway receiver. It probably wouldn't be too hard to add this support to gateway receiver.
Also, there is a gfsh load-balance gateway-sender command that could help alleviate this condition.
I'm still reviewing the PR.
I ran a few tests with some extra logging on these changes. They look good.
The receiver exchanges profiles with the locator:
[warn 2022/03/16 14:16:12.440 PDT locator-ln <Pooled High Priority Message Processor 2> tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap location=192.168.1.5:5370; load=0.0
[warn 2022/03/16 14:16:12.441 PDT locator-ln <Pooled High Priority Message Processor 2> tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0; currentLoad=0.0
[warn 2022/03/16 14:16:12.441 PDT locator-ln <Pooled High Priority Message Processor 2> tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0; newLoad=0.0
The connectionLoadMap shows 2 groups, namely the null group (default) and the __recv__group group (gateway receiver), each with load=0.0:
[warn 2022/03/16 14:16:13.777 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56224; load=0.0
group=__recv__group
location=192.168.1.5:5370; load=0.0
Sender connects to the receiver:
With the default of 5 dispatcher threads, 5 connections are made to the receiver. The load goes from 0.0 to 0.0062499996:
[warn 2022/03/16 14:16:53.836 PDT locator-ln <locator request thread 2> tid=0x47] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.0
[warn 2022/03/16 14:16:53.836 PDT locator-ln <locator request thread 2> tid=0x47] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.00125
[warn 2022/03/16 14:16:53.836 PDT locator-ln <locator request thread 6> tid=0x5c] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.00125
[warn 2022/03/16 14:16:53.836 PDT locator-ln <locator request thread 6> tid=0x5c] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.0025
[warn 2022/03/16 14:16:53.837 PDT locator-ln <locator request thread 5> tid=0x5b] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.0025
[warn 2022/03/16 14:16:53.837 PDT locator-ln <locator request thread 5> tid=0x5b] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.00375
[warn 2022/03/16 14:16:53.837 PDT locator-ln <locator request thread 4> tid=0x5a] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.00375
[warn 2022/03/16 14:16:53.837 PDT locator-ln <locator request thread 4> tid=0x5a] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.005
[warn 2022/03/16 14:16:53.838 PDT locator-ln <locator request thread 3> tid=0x59] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.005
[warn 2022/03/16 14:16:53.838 PDT locator-ln <locator request thread 3> tid=0x59] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.0062499996
The connectionLoadMap shows the same 2 groups but now the __recv__group group load is 0.0062499996 for the gateway receiver:
[warn 2022/03/16 14:16:55.831 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56224; load=0.0
group=__recv__group
location=192.168.1.5:5370; load=0.0062499996
Update the load:
Periodically, the server sends an updated load to the locator.
[warn 2022/03/16 14:16:57.464 PDT locator-ln <P2P message reader for 192.168.1.5(ln-1:75228)<v1>:41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.00625; currentLoad=0.0062499996
[warn 2022/03/16 14:16:57.464 PDT locator-ln <P2P message reader for 192.168.1.5(ln-1:75228)<v1>:41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.00625; newLoad=0.00625
[warn 2022/03/16 14:16:57.832 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56224; load=0.0
group=__recv__group
location=192.168.1.5:5370; load=0.00625
Update the load after ping connection has been made:
After another connection is made, the load is updated again.
[warn 2022/03/16 14:17:02.466 PDT locator-ln <P2P message reader for 192.168.1.5(ln-1:75228)<v1>:41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0075; currentLoad=0.00625
[warn 2022/03/16 14:17:02.466 PDT locator-ln <P2P message reader for 192.168.1.5(ln-1:75228)<v1>:41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0075; newLoad=0.0075
[warn 2022/03/16 14:17:03.841 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56224; load=0.0
group=__recv__group
location=192.168.1.5:5370; load=0.0075
Connect another sender:
Another sender with 5 dispatcher threads connects, and the load is updated again.
[warn 2022/03/16 14:29:44.794 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56600; load=0.0
group=__recv__group
location=192.168.1.5:5190; load=0.015
Disconnect one sender:
When a sender disconnects, the load is updated again.
[warn 2022/03/16 14:30:38.843 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56600; load=0.0
group=__recv__group
location=192.168.1.5:5190; load=0.0075
Start another receiver:
When another receiver is started, an entry for it is added to the connectionLoadMap with load=0.0.
[warn 2022/03/16 14:35:07.535 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:56940; load=0.0
location=192.168.1.5:56833; load=0.0
group=__recv__group
location=192.168.1.5:5055; load=0.015
location=192.168.1.5:5256; load=0.0
Two receivers and two senders:
When two receivers are started and two senders are connected, the load is updated (and balanced). In this case, the extra connections are pingers - one from each sender to each receiver.
[warn 2022/03/16 14:44:32.269 PDT locator-ln <Thread-14> tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap
The connectionLoadMap contains the following 2 entries:
group=null
location=192.168.1.5:57530; load=0.0
location=192.168.1.5:57553; load=0.0
group=__recv__group
location=192.168.1.5:5349; load=0.00875
location=192.168.1.5:5025; load=0.00875
Load balance senders:
This feature does not seem to be working properly. These changes seem to make it work better. I have another bunch of analysis on this that I will either post separately or file a JIRA on.
Hi @boglesby,
I also assumed that the same race condition is possible for the client connections, but I haven't tried to reproduce it. Thanks for pointing this out and lots of other valuable information. Also, thank you for the extensive testing you have done.
If we decide to go with this solution, I agree that we should make the load-poll-interval parameter configurable for gateway receivers. Changing it to the lower value would slightly mitigate race condition effects.
The load-balance gateways command is working on server this way:
- pauses gateway-sender
- destroys all connections and then rely upon the mechanism used during connection creation (ClientConnectionRequest/Response) to do the better load balancing
- resume gateway-sender
This command will result again in the burst of connection requests that could hit an issue caused by a race condition.
Maybe instead of sending load information periodically from the servers, the locator could scrape it (perhaps using CacheServerMXBean) from the servers and apply it simultaneously for all receivers in the locator. The locator could get load when it receives a connection request, and the current connection load is stale (e.g., older than 200 ms), as we don't expect many connections from gateway-senders. This way, the locator would at least have an up-to-date connection load taken at a similar time on all servers. This solution should even catch the change in connection load when the load-balance command destroys all connections.
Maybe, an algorithm that could work this way:
- Connection request received, check if a connection load is stale (older than new parameter load-update-frequency=200ms)
- if yes, then try to get connection load from all servers asynchronously
- if received load from all servers, then apply it in the locator
- if any get fails, then check profiles again and immediately retry for all servers
- Use immediately the current load
- if yes, then try to get connection load from all servers asynchronously
- If the connection request is not received, then just periodically get load, e.g., every 5 seconds (load-poll-interval)
Not sure if this makes any sense as I don't know how fast locator can scrape the load. I can create a prototype if you see that this could maybe work?
Thats a pretty cool idea. I'm not sure whether the CacheServerMXBean has that behavior, but I guess it could be added. In any event, I think this change is good. I'm approving this change, but you need to address the ParallelGatewaySenderConnectionLoadBalanceDistributedTest failure.
Hi @Bill , @echobravopapa , @kamilla1201 and @pivotal-jbarrett , this PR requires reviews from your side to merge it. Could you please review it?
I think I would like @upthewaterspout to take a look at this too for good measure.
This PR has been hanging for a long time now, and we should decide whether to close it or merge it.
I think this PR adds value to Apache geode if we at least "synchronize" sending of CacheServerLoadMessage on all servers. Current 5 seconds possible difference is just too much. I think this could be done with the following simple not so smart algorithm:
@@ -159,13 +162,33 @@ public class LoadMonitor implements ConnectionListener {
}
}
+ /**
+ * This function calculates next interval absolute time that is same on all servers in
+ * the cluster if following conditions are fulfilled:
+ * - same pollInterval value is used
+ * - time is synchronized on servers
+ *
+ * @return absolute time of next interval
+ */
+ private long getNextIntervalSynchronizedAbsoluteTime(final long currentTime,
+ final long pollInterval) {
+ return (currentTime - (currentTime % pollInterval)) + pollInterval;
+ }
+
@Override
public void run() {
while (alive) {
try {
synchronized (signal) {
- long end = System.currentTimeMillis() + pollInterval;
- long remaining = pollInterval;
+ long currentTime = System.currentTimeMillis();
+ long end, remaining;
+ if (isGatewayReceiver) {
+ end = getNextIntervalSynchronizedAbsoluteTime(currentTime, pollInterval);
+ remaining = end - currentTime;
+ } else {
+ end = currentTime + pollInterval;
+ remaining = pollInterval;
+ }
while (alive && remaining > 0) {
signal.wait(remaining);
remaining = end - System.currentTimeMillis();
@boglesby what do you think about this?