lettuce Periodic cluster topology refresh blocks commands execution.

Bug Report

Current Behavior

Periodic cluster topology refresh blocks commands execution.

From JFR,

topology refresh took 1.5 sec Start Time Duration End Time Event Thread sources 7/5/24, 5:49:06.714 PM 1.500 s 7/5/24, 5:49:08.215 PM lettuce-nioEventLoop-5-2 xxxx,xxxx
Thread is parked to execute get command during the topology refresh period

Start Time	Duration	End Time	Event Thread	Address of Object Parked	Class Parked On	Park Timeout	Park Until
7/5/24, 5:49:06.734 PM	1.380 s	7/5/24, 5:49:08.115 PM	io-executor-thread-9	0xE9789188	java.util.concurrent.CompletableFuture$Signaller	60.000 s	12/2/55, 4:47:04.192 PM

get command stack:
  Stack Trace	Count	Percentage
  void jdk.internal.misc.Unsafe.park(boolean, long):-1	1	100 %
  void java.util.concurrent.locks.LockSupport.parkNanos(Object, long):269	1	100 %
  boolean java.util.concurrent.CompletableFuture$Signaller.block():1866	1	100 %
  void java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool$ManagedBlocker):3780	1	100 %
  void java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool$ManagedBlocker):3725	1	100 %
  Object java.util.concurrent.CompletableFuture.timedGet(long):1939	1	100 %
  Object java.util.concurrent.CompletableFuture.get(long, TimeUnit):2095	1	100 %
  boolean io.lettuce.core.protocol.AsyncCommand.await(long, TimeUnit):83	1	100 %
  Object io.lettuce.core.internal.Futures.awaitOrCancel(RedisFuture, long, TimeUnit):244	1	100 %
  Object io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(Object, Method, Object[]):131	1	100 %
  Object io.lettuce.core.internal.AbstractInvocationHandler.invoke(Object, Method, Object[]):80	1	100 %
  Object jdk.proxy2.$Proxy119.get(Object):-1	1	100 %
  byte[] oracle.oic.v3.mtms.infra.cache.impl.AbstractCacheImpl.lambda$get$8(byte[], String):280	1	100 %
  .....
  .....

Expected behavior/code

As for reference guide, from Lettuce 6.0, Cluster topology refresh is now non-blocking. But from JFR, it shows as blocking.

Environment

Lettuce version(s): 6.2.6.RELEASE
Redis version: 6.2.6

Possible Solution

Additional context

We are executing commands using synchronous api. Our redis cluster consists of 9 nodes: 3 master and each master has 2 replicas.

Jul 08 '24 14:07 ashok-shukla

Hey @ashok-shukla ,

what lead you to conclude that the parked thread is handling topology refresh? Is there some minimum reproducible example of the problem we can take a look at?

Jul 09 '24 11:07 tishun

(Asynchronous topology refresh is handled in #1107)

Jul 09 '24 11:07 tishun

parked thread is not handling topology refresh. but waiting for topology refresh.

io-executor-thread-9: is parked from 7/5/24, 5:49:06.734 PM to 7/5/24, 5:49:08.115 PM. lettuce-nioEventLoop-5-2: doing topology refresh from 7/5/24, 5:49:06.714 PM to 7/5/24, 5:49:08.215 PM.

I do not have a minimum reproducible example now. if possible, I will try to create one.

Jul 09 '24 14:07 ashok-shukla

reproducer: cache.zip

download cache.zip and unzip
update REDIS_HOST and REDIS_PORT in CheckTopologyRefresh.java as per your redis cluster env. Our redis cluster nodes running in kubernetes and we just point to redis cluster kubernetes service as redis host.
Update ssl options as per you setup.
build the project: mvn clean install
run the jar with following command: java -Xms1024m -Xmx1024m -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 -XX:+FlightRecorder -XX:StartFlightRecording=duration=300s,filename=lettuce5min.jfr -jar cache-1.0-SNAPSHOT-jar-with-dependencies.jar 5 | tee test5min.log
compare topology refresh time in lettuce5min.jfr with time in test5min.log. Check response time during that period.

Jul 10 '24 22:07 ashok-shukla

Awesome! Much appreciated!

Jul 11 '24 08:07 tishun