lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Abnormal Mget latency increase issue

Open ackerL opened this issue 1 year ago • 2 comments

Bug Report

We own service A fleet with 500+ fleet capacity, the 500+ hosts leverage Lettuce client to access the Redis cluster(around 20shards, total 100 hosts). Recently we observe the anomalies caused by service fleet deployment(gradually deployment, ~20 host per round, each host deployment cost ~10mins). During the deployment, we find that the mget(emit from service A view)latency increased a lot(from 15ms to 20+ms). image

Figure 1: Service A uses Lettuce to access Redis cluster and mget latency increase during the fleet deployment

image Figure 2: Mget latency increase from 15ms to 20ms during the fleet deployment

After checking service A log, especially for the lettuce log, we do not observe any anomalies. Currently we can not explain that why the service A fleet deployment will trigger the mget latency increase. The only variable is the fleet deployment.

Is there any clues that can help for the next step trouble shooting on the abnormal latency increase issue? Thanks

Current Behavior

Stack trace
// your stack trace here;

Input Code

Input Code
// your code here;

Expected behavior/code

Environment

  • Lettuce version(s): 5.3.X.
  • Redis version: 5.X

Possible Solution

Additional context

ackerL avatar Oct 29 '24 15:10 ackerL

The team will attempt to dig some more in this issue, but from the quick read that I did it would be extremely hard, close to impossible, to answer the question without a lot of more information being provided.

Latency spikes of 5ms is an extremely low threshold and could be caused by virtually any of the actors in the chain.

Unless you detect some difference in the way the driver behaves (by profiling it while this issue occurs and monitoring the traffic) we could only play a guessing game, which is not helpful for anyone.

tishun avatar Oct 30 '24 11:10 tishun

Hi tishun@, thanks for your attention on the issue. Let me try to add more details on the issue and our suspicious.

Our suspicion is that the abnormal mget latency increase issue may be related to connection problems. In our use case, we utilize Lettuce as the Redis client and initialize both read and write connections. see the code snippet below:

When service A starts, it creates a read connection and a write connection. The redisURIs array contains all the URIs for the Redis nodes in the cluster. So in ideal case, one instance in Service A will create two connections to the Redis cluster.

// Step 1. create read connection
this.readClusterClient = RedisClusterClient.create(CLIENT_RESOURCES, redisURIs);
readClusterClient.setOptions(createClusterClientOptions());
this.readClusterConnection = readClusterClient.connect(new ByteArrayCodec());
readClusterConnection.setTimeout(readTimeoutMs, TimeUnit.MILLISECONDS);
readClusterConnection.setReadFrom(ReadFrom.NEAREST);


// Step 2. create write connection
this.writeClusterClient = RedisClusterClient.create(CLIENT_RESOURCES, redisURIs);
writeClusterClient.setOptions(createClusterClientOptions());
this.writeClusterConnection = writeClusterClient.connect(new ByteArrayCodec());
writeClusterConnection.setTimeout(writeTimeoutMs, TimeUnit.MILLISECONDS);
writeClusterConnection.setReadFrom(ReadFrom.MASTER);

During the deployment of service A, we observe the connected client metric in Redis service is not stable, but it shared the same pattern as mget latency. image

T1: The start time of deployment T2: The end time of deployment We can find that during the deployment, when the connected clients dropped, the mget latency started to decrease; As the deployment progressed, the number of connected clients began to rise, the mget latency increased correspondingly.

Here there are two things i want to elaborate.

  1. Why the connected clients drop? -> It should be related to that the deployment will restart partial instances in Service A fleet, around 60 hosts one time, thus, when the instances are shut down, the connection to the redis serve will reduce.
  2. Why the connected clients increase? -> It should be occur in the service startup and the instances of service A need to setup the connection again. We monitor one instance(different use cases could be different) in Service A, the connected clients to one redis node, changed from 4(before deployment) -> 0(service shutdown) -> 1(service start up) -> 2(service in running status). Currently we are not sure the connection changed from 1 -> 2 instead of creating 2 directly when the service starts up.

I believe there are some thing wrong with the connection in the step #1 and step #2 above, An interesting observation is that some clients established an excessive number of connections to a single Redis server. For instance, the client with IP address 10.117.154.244 has five active connections to one Redis server. we execute ./redis-cli CLIENT LIST|grep 10.117.154.244 on the redis node

id=118437258 addr=10.117.154.244:36328 fd=2093 name= age=166473 idle=453 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=mget
id=124013924 addr=10.117.154.244:38930 fd=524 name= age=111 idle=111 flags=r db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=readonly
id=123908775 addr=10.117.154.244:33272 fd=925 name= age=3389 idle=3389 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
id=123912325 addr=10.117.154.244:50334 fd=44 name= age=3284 idle=3284 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
id=123991268 addr=10.117.154.244:38712 fd=1365 name= age=825 idle=825 flags=r db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=readonly

The output shows multiple connections from this client, which is concerning, as we expect a single client machine to have no more than two connections to a Redis server. The high number of connections from certain clients is likely degrading the performance of the Redis server, which in turn is increasing the mget latency for service A.

We want to get some insights that:

  1. In which scenarios that Lettuce might create an excessive number of connected clients to a single Redis server?
  2. What impact does having too many connections from a client on one Redis node? More connections will result in the efficiency of executing commands i believe.
  3. How can we prevent the situation where too many connected clients occur? limit to 2 connections from instance of Service A to one Redis node. Currently, we are using version 5.3.X of Lettuce. I am not sure if there is some hidden bug in the legacy version, especially related to the connection.

Additionally, we have verified key metrics on the Redis server and found no anomalies:

  • CPU usage remains low (<5%).
  • Redis memory utilization is low (no replication occurred, used memory/total allocated memory < 0.3).
  • The traffic pattern for mget remains unchanged.
  • We do not observe any network anomalies (<0.035MB per minute).

ackerL avatar Oct 31 '24 11:10 ackerL