PubSubConnection recurring disconnections after Redis cluster is downscaled
Bug Report
Current Behavior
Given a setup where we have a StatefulRedisClusterPubSubConnection on a particular Redis node in a Redis cluster. When we remove that Redis node from the cluster, after the connection has been reconnected to another Redis node (by ConnectionWatchdog), lettuce keeps killing this connection periodically. This occurs indefinitely from then on. The periodicity of this connection drop matches whatever duration I set for periodic refresh of ClusterTopologyRefresh (25 seconds in my code snippet below).
Logs of the application - Notice the subscribe connection getting dropped every 25 seconds (which matches ClusterTopologyRefresh periodic refresh duration) ->
Jul 01, 2023 7:23:26 PM org.personal.redis.LettuceClusterReconnectIssueTestV2 thread_sleep
INFO: Sleeping for 500 seconds
Jul 01, 2023 7:23:54 PM org.personal.redis.LettuceClusterReconnectIssueTestV2 lambda$main$1
INFO: Event : ReconnectAttemptEvent [local:any -> /127.0.0.1:30011]
2023-07-01 19:23:54.871 INFO - Reconnecting, last destination was /127.0.0.1:30011
2023-07-01 19:23:54.882 INFO - Reconnected to 127.0.0.1:30013
2023-07-01 19:24:16.892 WARN - Unable to connect to [127.0.0.1:30011]: Connection refused: /127.0.0.1:30011
Jul 01, 2023 7:24:16 PM org.personal.redis.LettuceClusterReconnectIssueTestV2 lambda$main$1
INFO: Event : ClusterTopologyChangedEvent [before=3, after=2]
2023-07-01 19:24:16.961 INFO - Reconnecting, last destination was /127.0.0.1:30013
2023-07-01 19:24:16.967 INFO - Reconnected to 127.0.0.1:30012
Jul 01, 2023 7:24:16 PM org.personal.redis.LettuceClusterReconnectIssueTestV2 lambda$main$1
INFO: Event : ReconnectAttemptEvent [local:any -> /127.0.0.1:30013]
2023-07-01 19:24:41.960 INFO - Reconnecting, last destination was /127.0.0.1:30012
2023-07-01 19:24:41.969 INFO - Reconnected to 127.0.0.1:30013
Jul 01, 2023 7:24:41 PM org.personal.redis.LettuceClusterReconnectIssueTestV2 lambda$main$1
INFO: Event : ReconnectAttemptEvent [local:any -> /127.0.0.1:30012]
2023-07-01 19:25:06.960 INFO - Reconnecting, last destination was /127.0.0.1:30013
2023-07-01 19:25:06.965 INFO - Reconnected to 127.0.0.1:30012
Jul 01, 2023 7:25:06 PM org.personal.redis.LettuceClusterReconnectIssueTestV2 lambda$main$1
INFO: Event : ReconnectAttemptEvent [local:any -> /127.0.0.1:30013]
To Reproduce the issue
Setup Redis version - 7.0.7 Lettuce version - 6.2.4.RELEASE
Have a local Redis cluster with 3 master nodes (assuming node IPs 30011, 30012, 30013)
Simple application code to reproduce the issue. While the application is running, identify the redis node on which the subscribe connection is made (assuming it is 30011), get it's cluster ID (assume it's 'dcb98b27b9d7b576b8552d23a7ad87ae77f7f0c9') and run the below commands on the local Redis cluster (while the application is running)->
redis-cli -p 30011 shutdown nosave
redis-cli -p 30012 -c cluster forget dcb98b27b9d7b576b8552d23a7ad87ae77f7f0c9
redis-cli -p 30013 -c cluster forget dcb98b27b9d7b576b8552d23a7ad87ae77f7f0c9
Simple application code to reproduce the issue - (notice that periodic refresh of ClusterTopologyRefresh is kept at 25 seconds) ->
import io.lettuce.core.cluster.ClusterClientOptions;
import io.lettuce.core.cluster.ClusterTopologyRefreshOptions;
import io.lettuce.core.cluster.RedisClusterClient;
import io.lettuce.core.cluster.pubsub.StatefulRedisClusterPubSubConnection;
import java.time.Duration;
import java.util.logging.Logger;
import reactor.core.Disposable;
public class LettuceClusterReconnectIssueTestV2 {
static Logger LOG = Logger.getLogger(LettuceClusterReconnectIssueTestV2.class.getName());
public static void main(String[] args) {
RedisClusterClient redisClient = RedisClusterClient.create("redis://localhost:30012");
redisClient.setOptions(getClientOptions(getTopologyRefreshOptions()));
StatefulRedisClusterPubSubConnection<String, String> subscribeConnection = redisClient.connectPubSub();
Disposable eventBusSubscription = redisClient.getResources().eventBus().get().subscribe(event -> {
LOG.info(String.format("Event : %s", event));
});
subscribeConnection.sync().subscribe("testSubscribeChannel");
thread_sleep(500_000);
// cleanup
subscribeConnection.close();
eventBusSubscription.dispose();
redisClient.shutdown();
}
public static void thread_sleep(int sleepTimeInMillis) {
try {
LOG.info(String.format("Sleeping for %s seconds", sleepTimeInMillis/1000));
Thread.sleep(sleepTimeInMillis);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private static ClusterClientOptions getClientOptions(
final ClusterTopologyRefreshOptions topologyRefreshOptions) {
return ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.validateClusterNodeMembership(false)
.build();
}
private static ClusterTopologyRefreshOptions getTopologyRefreshOptions() {
final ClusterTopologyRefreshOptions.Builder topologyRefreshOptionsBuilder =
ClusterTopologyRefreshOptions.builder()
.enableAllAdaptiveRefreshTriggers()
.enablePeriodicRefresh(Duration.ofSeconds(25));
return topologyRefreshOptionsBuilder.build();
}
}
Expected behavior/code
The subscribe connection should not get killed (and recreated) periodically after we have downscaled the Redis cluster.
Environment
- Redis version - 7.0.7
- Lettuce version - 6.2.4.RELEASE
Have a local Redis cluster with 3 master nodes (assuming node IPs 30011, 30012, 30013)
I debugged further and found the source of the issue - In StatefulRedisClusterPubSubConnectionImpl.activated() method - async.clusterMyId() fails in all subsequent reconnections except the first time we create the connection with the error 'Command CLUSTER not allowed while subscribed. Allowed commands are: [PSUBSCRIBE, QUIT, PUNSUBSCRIBE, SUBSCRIBE, UNSUBSCRIBE, PING]' (due to https://github.com/lettuce-io/lettuce-core/blob/main/src/main/java/io/lettuce/core/pubsub/PubSubEndpoint.java#L131). This succeeds when the connection is first created as there were no subscriptions at that point. What this implies is that the nodeId of the StatefulRedisClusterPubSubConnection never gets changed whenever reconnection occurs for this connection.
This doesn't cause any problems except that when that particular nodeId is not part of the cluster anymore - due to https://github.com/lettuce-io/lettuce-core/blob/main/src/main/java/io/lettuce/core/cluster/StatefulRedisClusterPubSubConnectionImpl.java#L189. If the nodeId is not part of the cluster, then, ClusterTopologyRefresh (which triggers StatefulRedisClusterPubSubConnectionImpl.setPartitions) will disconnect this connection everytime it gets triggered.
This looks like a genuine bug to me but I am unsure as to how nobody else has encountered it till now as redis downscaling should be a normal scenario for anyone. @mp911de Could you please tell me if this is indeed a genuine bug, so I can work towards other workarounds to fix this for now for my application.
The rejection of Command CLUSTER is indeed a bug because the command is issued after resubscribing. I think that we can skip this call on activation if the nodeId is already set. Additionally, resubscribe already obtains the node id and we can reuse the identifier from there.
You can disable stale connection closing via ClusterClientOptions.CloseStaleConnections, however, at some point, the topology should be recent and you should not see disconnects anymore.
Thanks for the quick reply. Without disabling the stale connections, the disconnects continue indefinitely (it continued for at least 4 days in my own application). IMO this is because nodeId of the StatefulRedisClusterPubSubConnection will never change even though the topology is refreshed. For now, do we have any other option to handle this besides disabling closeStaleConnections as it is not recommended by lettuce docs ? We were considering disabling the auto-reconnect and handle re-creation of connection from our side.
@egoissst @mp911de Hello, I'm encountering the same issue. If you have solved this, could you please share how you did it? Thank you.
@benny1020 We didn't find any solution. So, we ended up disabling the auto-reconnect and handled the re-creation of pub-sub connections from application side.