lettuce Adaptive refresh doesn't work if the topology is outdated

Bug Report

Current Behavior

My Redis cluster is deployed as below:

 redis-cli -c CLUSTER NODES
9a6e2d69493c78ed4cdf4c3125772336d175b438 192.168.151.22:6379@16379 master - 0 1690268016509 21 connected 10924-16383
687af672e2c851ec75afe55e848d197f748f5826 192.168.65.248:6379@16379 master - 0 1690268017000 16 connected 5462-10923
1b974225e60fbb0cf2312de5f9ad03ffbee315f5 192.168.13.59:6379@16379 myself,slave 687af672e2c851ec75afe55e848d197f748f5826 0 1690268017000 16 connected
39253b1933ee60417373d191561bb15be2e8de77 192.168.74.50:6379@16379 master - 0 1690268017516 14 connected 0-5461
41691e3435adc5e1ffd226b6c60ddbe3b8d4b393 192.168.86.159:6379@16379 slave 39253b1933ee60417373d191561bb15be2e8de77 0 1690268016000 14 connected

We have three master node, each master node have one slave node. Our redis client is configured with adaptive refresh only, periodic refresh is not enabled.

After our application startup, it will connect to all master node due to some initialize task. Due to some reason, the master node of the same master/slave pair terminated twice. Our redis node is deployed as kubernetes pod, after the master node terminated, the slave node becomes master, and a new redis node pod is created and becomes new slave.

The connection status and topology our application knows is as below:

# redis nodes
pod1: 127.0.0.1 pod2: 127.0.0.2  (pod1 is master, pod2 is slave of pod1)
pod3: 127.0.0.3 pod4: 127.0.0.4  (pod3 is master, pod4 is slave of pod3)
pod5: 127.0.0.5 pod6: 127.0.0.6  (pod5 is master, pod6 is slave of pod5)

# After startup and do some initialize task
# application knows all redis nodes, and have below connections
app -> 127.0.0.1:6379
app -> 127.0.0.3:6379
app -> 127.0.0.5:6379

# pod5 terminated, pod6 becomes master, and pod7 is created
# redis nodes
pod1: 127.0.0.1 pod2: 127.0.0.2  (pod1 is master, pod2 is slave of pod1)
pod3: 127.0.0.3 pod4: 127.0.0.4  (pod3 is master, pod4 is slave of pod3)
pod6: 127.0.0.6 pod7: 127.0.0.7  (pod6 is master, pod7 is slave of pod6)

# connection to pod5 broken and trigger adaptive refresh
# application knows all redis nodes, and have below connections
app -> 127.0.0.1:6379
app -> 127.0.0.3:6379

# pod6 terminated, pod7 becomes master, and pod8 is created
# redis nodes
pod1: 127.0.0.1 pod2: 127.0.0.2  (pod1 is master, pod2 is slave of pod1)
pod3: 127.0.0.3 pod4: 127.0.0.4  (pod3 is master, pod4 is slave of pod3)
pod7: 127.0.0.7 pod8: 127.0.0.8  (pod7 is master, pod8 is slave of pod7)

# since application is not connected to pod6, pod6's termination doesn't trigger adaptive refresh
# application maintain outdated topology, still think pod6 exists and is master, and have below connections
app -> 127.0.0.1:6379
app -> 127.0.0.3:6379

After all this, and application starts to handle traffic and trigger redis operation. Lettuce client will keep trying to connect to pod6 and always failed. But this connection failure will not trigger adaptive refresh, lettuce client just keep trying to connect to the terminated master node.

2023-07-25T06:55:33.313Z debug io.lettuce.core.RedisChannelHandler - f8a8107ff8c0d00bc8367068ae9f5dac dispatching command AsyncCommand [type=MGET, output=KeyValueListOutput [output=[], error='null'], commandType=io.lettuce.core.protocol.Command]
2023-07-25T06:55:33.313Z debug i.l.c.c.PooledClusterConnectionProvider - f8a8107ff8c0d00bc8367068ae9f5dac getConnection(READ, 11368)
2023-07-25T06:55:33.313Z debug i.l.c.c.PooledClusterConnectionProvider$DefaultClusterNodeConnectionFactory - f8a8107ff8c0d00bc8367068ae9f5dac Resolved SocketAddress 192.168.13.107:6379 using for Cluster node at 192.168.13.107:6379
2023-07-25T06:55:33.313Z debug io.lettuce.core.AbstractRedisClient - f8a8107ff8c0d00bc8367068ae9f5dac Connecting to Redis at 192.168.13.107:6379
2023-07-25T06:55:33.314Z debug i.l.core.protocol.CommandHandler -  [channel=0x9843b0cc, [id: 0x69183a30] (inactive), epid=0x59, chid=0x91] channelRegistered()
2023-07-25T06:55:33.315Z debug i.l.core.protocol.CommandHandler -  [channel=0x9843b0cc, /192.168.140.105:50540 -> 192.168.13.107/192.168.13.107:6379, epid=0x59, chid=0x91] write(ctx, AsyncCommand [type=HELLO, output=GenericMapOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], promise)
2023-07-25T06:55:33.315Z debug i.l.core.protocol.CommandEncoder -  [channel=0x9843b0cc, /192.168.140.105:50540 -> 192.168.13.107/192.168.13.107:6379] writing command AsyncCommand [type=HELLO, output=GenericMapOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2023-07-25T06:55:33.414Z error io.lettuce.core.RedisCommandTimeoutException: Command timed out after 100 millisecond(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:59)
	at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
	at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:130)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy65.mget(Unknown Source)
	at com.example.jigsaw.periodiccounter.lettuce.LettucePeriodicCounter.mGetCounts(LettucePeriodicCounter.java:68)
	... 21 common frames omitted


2023-07-25T06:55:33.422Z info f8a8107ff8c0d00bc8367068ae9f5dac,eric-apigm-api-proxy-56f855566f-cdtwz,,,2023-07-25T06:55:33.312Z,110,3,401,error code 4001-110: Credential invalid,HTTP_2,GET,uesessioninfo-NBI,ue-ip-session-info,v1,eric-ees-uesessionbroker:8080,/hub/ue-ip-session-info/v1/session-report-subscriptions/baxxxddxx,,,
2023-07-25T06:55:33.459Z debug io.lettuce.core.AbstractRedisClient -  Connecting to Redis at 192.168.13.107:6379, initialization: 192.168.13.107:6379io.lettuce.core.RedisCommandTimeoutException: Connection initialization timed out. Command timed out after 100 millisecond(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:71)
	at io.lettuce.core.protocol.RedisHandshakeHandler.lambda$channelRegistered$0(RedisHandshakeHandler.java:62)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

2023-07-25T06:55:33.459Z debug io.lettuce.core.RedisChannelHandler -  closeAsync()
2023-07-25T06:55:33.459Z debug i.l.core.cluster.ClusterNodeEndpoint -  [unknown, epid=0x59] closeAsync()
2023-07-25T06:55:33.459Z debug i.l.core.protocol.DefaultEndpoint -  [unknown, epid=0x59] closeAsync()
2023-07-25T06:55:33.459Z debug i.l.core.protocol.CommandHandler -  [channel=0x9843b0cc, /192.168.140.105:50540 -> 192.168.13.107/192.168.13.107:6379, epid=0x59, chid=0x91] channelInactive()
2023-07-25T06:55:33.459Z debug i.l.core.protocol.DefaultEndpoint -  [unknown, epid=0x59] deactivating endpoint handler
2023-07-25T06:55:33.459Z debug i.l.core.protocol.CommandHandler -  [channel=0x9843b0cc, /192.168.140.105:50540 -> 192.168.13.107/192.168.13.107:6379, epid=0x59, chid=0x91] channelInactive() done
2023-07-25T06:55:33.459Z debug i.l.core.protocol.ConnectionWatchdog -  [unknown, last known addr=null] channelInactive()
2023-07-25T06:55:33.459Z debug i.l.core.protocol.ConnectionWatchdog -  [unknown, last known addr=null] ConnectionWatchdog not armed
2023-07-25T06:55:33.459Z debug i.l.core.protocol.CommandHandler -  [channel=0x9843b0cc, /192.168.140.105:50540 -> 192.168.13.107/192.168.13.107:6379, epid=0x59, chid=0x91] channelUnregistered()
2023-07-25T06:55:34.466Z debug io.lettuce.core.RedisChannelHandler - 4290ae676239e8388ca8a66c6a675919 dispatching command AsyncCommand [type=MGET, output=KeyValueListOutput [output=[], error='null'], commandType=io.lettuce.core.protocol.Command]

# stack repeat here, lettuce keep trying to connect to the terminated master node
2023-07-25T06:55:34.466Z debug i.l.c.c.PooledClusterConnectionProvider - 4290ae676239e8388ca8a66c6a675919 getConnection(READ, 11368)

Input Code

// your code here;
    ClusterTopologyRefreshOptions topologyRefreshOptions =
        ClusterTopologyRefreshOptions.builder()
            .enableAllAdaptiveRefreshTriggers()
            .adaptiveRefreshTriggersTimeout(Duration.ofSeconds(30))
            .build();
    redisClient.setOptions(
        ClusterClientOptions.builder().topologyRefreshOptions(topologyRefreshOptions).build());

Expected behavior/code

Lettuce client can recover from this situation. Even it maintains an outdated topology, the connection failure should trigger adaptive refresh and load new topology.

Environment

Lettuce version(s): 6.1.10
Redis version: 6.2.12

Possible Solution

I check the code that connection watch dog only schedule reconnect when armed. And the armed status is set when channel is active. What is this limitation for? Is it possible to remove the arm check?

Additional context

Jul 25 '23 07:07 kjd007

I'm not quite sure what you're asking for. Specifically, I can help you with

I check the code that connection watch dog only schedule reconnect when armed. And the armed status is set when channel is active. What is this limitation for? Is it possible to remove the arm check?

ConnectionWatchdog attempts a reconnect if the logical connection is open. If the connection has been closed by intent, there's no point in reconnecting because the connection should be closed.

Looking at your input code,

        ClusterTopologyRefreshOptions.builder()
            .enablePeriodicRefresh(Duration.ofSeconds(redisConfig.getPeriodicRefreshSeconds()))

Our redis client is configured with adaptive refresh only, periodic refresh is not enabled.

isn't true as periodic refresh is enabled.

Jul 25 '23 08:07 mp911de

We fix the problem by enable periodic refresh, sorry that I paste the wrong code, already remove it.

Due to some unknown reason, the master node of our redis cluster terminated twice consecutively. After the cluster recover, we run into a situation that lettuce client maintains a outdated topology.

ConnectionWatchdog attempts a reconnect if the logical connection is open. If the connection has been closed by intent, there's no point in reconnecting because the connection should be closed.

If this case, the connection is not closed by intent, the channel never become active, it tries to connect to a not exists IP and fail, then trigger channelInactive. But ConnectionWatchdog set armed when channel is active. Thus it will not trigger adaptive refresh.

Jul 25 '23 09:07 kjd007

If lettuce try to create a channel but unable to connect, this should also be a valid case to trigger adaptive refresh.

Not sure how armed help with close by intent. If a channel is close by intent, it should becomes active before it's closed, right? In this case, the channel never becomes active.

Jul 25 '23 10:07 kjd007

Lettuce client will keep trying to connect to pod6 and always failed. But this connection failure will not trigger adaptive refresh, lettuce client just keep trying to connect to the terminated master node.

I am confused. If the driver is unable to connect how is it supposed to refresh the topology? Which seed nodes do you provide?

Feb 14 '25 15:02 tishun

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 2 weeks this issue will be closed.

Mar 17 '25 00:03 github-actions[bot]

I'm not sure if this is quite apples to apples, but in my case the configuration is REPLICA_PREFERRED. The applications starts up all is good, but during runtime if the slaves are restarted they do not attempt to reconfigure the slave ip addresses from the provided kube-dns master hostname to self heal after the non static redis slave pod ips are are assigned. I haven't fully explored, but have noticed the bootstrap configuration being updated to a raw ip from the hostname it initially has.

Mar 21 '25 19:03 Freydal

I'm not sure if this is quite apples to apples, but in my case the configuration is REPLICA_PREFERRED. The applications starts up all is good, but during runtime if the slaves are restarted they do not attempt to reconfigure the slave ip addresses from the provided kube-dns master hostname to self heal after the non static redis slave pod ips are are assigned. I haven't fully explored, but have noticed the bootstrap configuration being updated to a raw ip from the hostname it initially has.

Hey @Freydal - no matter if this is the same issue or not, if you can help analyse the problem (with some example, steps, whatever you can to help us pinpoint the possible issue) it would be highly appreciated!

Mar 25 '25 09:03 tishun

Sure.

I'm not sure the full scope of configuration details that would be needed to replicate the issue, but I will try to set up something replicable.

In a nonprod cluster there was an upgrade or other event that caused complete node recreation. These applications are utilizing a common internal spring cache autoconfigure jar that iscreating a LettuceClientConfiguration(spring-data-redis-3.0.1) from its builder and only setting readFrom to REPLICA_PREFERRED. Rescheduled application pods were fine until the Redis slave pods had been rescheduled to a new node. The watchdog began complaining after the pods they were configured to reconnect to were no longer available. This failure didn't provide an opportunity to abandon the delete pods and attempt to rediscover.

This is easily reproducible by restarting Redis slave pods.

Mar 25 '25 19:03 Freydal

Rescheduled application pods were fine until the Redis slave pods had been rescheduled to a new node. The watchdog began complaining after the pods they were configured to reconnect to were no longer available. This failure didn't provide an opportunity to abandon the delete pods and attempt to rediscover.

This is easily reproducible by restarting Redis slave pods.

Is adaptive refresh enabled on this environment? Can you share a code snippet?

Apr 07 '25 16:04 tishun

I can't directly. I'm not sure at what depth you are interested in. The ReactiveRedisConnectionFactory bean is:

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.redis.connection.ReactiveRedisConnectionFactory;
import org.springframework.data.redis.connection.RedisStandaloneConfiguration;
import org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory;
import org.springframework.data.redis.connection.lettuce.LettuceClientConfiguration;

import static io.lettuce.core.ReadFrom.REPLICA_PREFERRED;

@Configuration
public class RedisConfig {

    @Bean
    public ReactiveRedisConnectionFactory reactiveRedisConnectionFactory(
            String redisHost, 
            int redisPort, 
            String redisPassword) {
      LettuceClientConfiguration clientConfig = LettuceClientConfiguration.builder().readFrom(REPLICA_PREFERRED).build();
      RedisStandaloneConfiguration redisConfig = new RedisStandaloneConfiguration(host, port);
      redisConfig.setPassword(redisPassword);
      return new LettuceConnectionFactory(redisConfig, clientConfig);
    }
}

Apr 09 '25 11:04 Freydal

Hey @Freydal sorry for the long delay.

I really got confused by your reply. You are setting up REPLICA_PREFFERED, but are using RedisStandaloneConfiguration. The RedisStandaloneConfiguration creates a connection to a single node, so it would most probably ignore the ReadFrom, which only makes sense if you are using a clustered/sentinel connection.

Am I reading it wrong?

May 07 '25 11:05 tishun

Technically it wasn't me. I dug into this after the failure to support a team who consumed this from an internal library jar. I can tell you it does connect to the slave nodes. When they restart it doesn't repair the connections and just spams connection failure logging. This perhaps could be resolved by proper configuration, but I haven't continue to dig in this time.

May 09 '25 18:05 Freydal

Oh, I see. Makes sense.

This perhaps could be resolved by proper configuration, but I haven't continue to dig in this time.

If you (or anybody else from the team) get to that and have trouble please open up a ticket. For now it seems this is more of a configuration issue than a bug, so I can't do much more to help.

May 12 '25 06:05 tishun

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 2 weeks this issue will be closed.

Jun 12 '25 00:06 github-actions[bot]

lettuce lettuce copied to clipboard

Adaptive refresh doesn't work if the topology is outdated

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Possible Solution

Additional context

lettuce
lettuce copied to clipboard