Periodic cluster topology refresh blocks commands execution.
Bug Report
Current Behavior
Periodic cluster topology refresh blocks commands execution.
From JFR,
-
topology refresh took 1.5 sec Start Time Duration End Time Event Thread sources 7/5/24, 5:49:06.714 PM 1.500 s 7/5/24, 5:49:08.215 PM lettuce-nioEventLoop-5-2 xxxx,xxxx
-
Thread is parked to execute get command during the topology refresh period
Start Time Duration End Time Event Thread Address of Object Parked Class Parked On Park Timeout Park Until
7/5/24, 5:49:06.734 PM 1.380 s 7/5/24, 5:49:08.115 PM io-executor-thread-9 0xE9789188 java.util.concurrent.CompletableFuture$Signaller 60.000 s 12/2/55, 4:47:04.192 PM
get command stack:
Stack Trace Count Percentage
void jdk.internal.misc.Unsafe.park(boolean, long):-1 1 100 %
void java.util.concurrent.locks.LockSupport.parkNanos(Object, long):269 1 100 %
boolean java.util.concurrent.CompletableFuture$Signaller.block():1866 1 100 %
void java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool$ManagedBlocker):3780 1 100 %
void java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool$ManagedBlocker):3725 1 100 %
Object java.util.concurrent.CompletableFuture.timedGet(long):1939 1 100 %
Object java.util.concurrent.CompletableFuture.get(long, TimeUnit):2095 1 100 %
boolean io.lettuce.core.protocol.AsyncCommand.await(long, TimeUnit):83 1 100 %
Object io.lettuce.core.internal.Futures.awaitOrCancel(RedisFuture, long, TimeUnit):244 1 100 %
Object io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(Object, Method, Object[]):131 1 100 %
Object io.lettuce.core.internal.AbstractInvocationHandler.invoke(Object, Method, Object[]):80 1 100 %
Object jdk.proxy2.$Proxy119.get(Object):-1 1 100 %
byte[] oracle.oic.v3.mtms.infra.cache.impl.AbstractCacheImpl.lambda$get$8(byte[], String):280 1 100 %
.....
.....
Expected behavior/code
As for reference guide, from Lettuce 6.0, Cluster topology refresh is now non-blocking. But from JFR, it shows as blocking.
Environment
- Lettuce version(s): 6.2.6.RELEASE
- Redis version: 6.2.6
Possible Solution
Additional context
We are executing commands using synchronous api. Our redis cluster consists of 9 nodes: 3 master and each master has 2 replicas.
Hey @ashok-shukla ,
what lead you to conclude that the parked thread is handling topology refresh? Is there some minimum reproducible example of the problem we can take a look at?
(Asynchronous topology refresh is handled in #1107)
parked thread is not handling topology refresh. but waiting for topology refresh.
io-executor-thread-9: is parked from 7/5/24, 5:49:06.734 PM to 7/5/24, 5:49:08.115 PM. lettuce-nioEventLoop-5-2: doing topology refresh from 7/5/24, 5:49:06.714 PM to 7/5/24, 5:49:08.215 PM.
I do not have a minimum reproducible example now. if possible, I will try to create one.
reproducer: cache.zip
- download cache.zip and unzip
- update REDIS_HOST and REDIS_PORT in CheckTopologyRefresh.java as per your redis cluster env. Our redis cluster nodes running in kubernetes and we just point to redis cluster kubernetes service as redis host.
- Update ssl options as per you setup.
- build the project: mvn clean install
- run the jar with following command: java -Xms1024m -Xmx1024m -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 -XX:+FlightRecorder -XX:StartFlightRecording=duration=300s,filename=lettuce5min.jfr -jar cache-1.0-SNAPSHOT-jar-with-dependencies.jar 5 | tee test5min.log
- compare topology refresh time in lettuce5min.jfr with time in test5min.log. Check response time during that period.
Awesome! Much appreciated!