lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Clients unable to recover Cluster failover issue when connecting with multiple Redis clusters

Open umang92 opened this issue 3 years ago • 1 comments

Bug Report

Current Behavior

We have the following setup. Client code creates two RedisClusterClient objects to connect with different Redis Clusters at the same time.

We are using 2 AWS Elasticache clustered mode Redis, 6 shards. 1 Replica node for each shard.

Now we initiate a shard failover in one of the shards of one of the two Redis Clusters. What we observe is that as soon as the failover is initiated the client application starts getting RedisCommandTimeoutException, which is expected. But it is unable to recover from these errors and they keep showing up in large numbers even after 15-20 minutes. The system recovers only after restarting the client process.

We have tested the exact same scenario with the client application connecting to a single Redis Cluster. In this case, the client is able to recover from the RedisCommandTimeoutExceptions within 1 minute of initiating the failover.

We are using following Redis commands in our setup: a. rpush b. lpop c. pexpire

I am providing the client code in the Input code section. Please note that while moving from a single Redis Cluster to multiple Redis clusters, no client code changes were made.

Input Code

Input Code
//startup code
     RedisURI.Builder builder = RedisURI.Builder.redis(redisDetails.getHost(), redisDetails.getPort());

       logger.info("Enable SSL");
        builder=builder.withSsl(Boolean.TRUE);
        builder=builder.withPassword(properties.getRedisPassword());
   

    RedisURI redisURI = builder.withTimeout(Duration.ofSeconds(5)).build();
    redisClusterClient = RedisClusterClient.create(redisURI);
    if (redisClusterClient == null) {
        logger.info("Could not create Redis connection.");
        throw new Exception("Could not create redis connection.");
    } else
        logger.info("Redis connection created successfully.");

    ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
            .enablePeriodicRefresh(Duration.ofSeconds(properties.getRedisTopologyRefreshInterval()))
            .build();
    // periodic refresh interval is set to 15 seconds

    redisClusterClient.setOptions(ClusterClientOptions.builder()
            .topologyRefreshOptions(topologyRefreshOptions)
            .build());
    GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
    poolConfig.setMaxTotal(properties.getMaxRedisConnections());
    poolConfig.setMaxIdle(properties.getMaxRedisConnections());
    poolConfig.setMinIdle(properties.getMinRedisConnections());

    GenericObjectPool<StatefulRedisClusterConnection<String, byte[]>> pool = ConnectionPoolSupport
            .createGenericObjectPool(() -> redisClusterClient.connect(new StreamRedisCodec()), poolConfig);

   //per request code
   try {
        StatefulRedisClusterConnection<String, byte[]> connection =  pool.borrowObject();
        connection.sync().rpush(......)
      } catch(Exception ex){
        throw new IOException(ex);
      }

Expected behavior/code

Environment

  • Lettuce version: 5.3.0.RELEASE
  • Redis version: AWS elasticache Redis 5.0.6

Possible Solution

Additional context

umang92 avatar Aug 25 '20 18:08 umang92

Each RedisClusterClient has its own topology refresh and set of connections it manages. You'd probably need to enable debug logs to trace topology updates and check the topology state for each client whether it reflects the most recent changes.

mp911de avatar Sep 04 '20 09:09 mp911de