ioredis Long Recovery Time After Failover with AWS Multi-AZ Clusters

Long Recovery Time After Failover with AWS Multi-AZ Clusters

Open JagandeepBrar opened this issue 3 years ago • 0 comments

We are attempting to enable multi-AZ support on our AWS clusters but we are experiencing an unexpectedly high recovery time after a failover event.

We first approached AWS for support but they have shown us that this is not normal behaviour and only occurs when utilizing ioredis. When utilizing the Python library redis-py-cluster we experience a recovery time that is much more in-line with the expectations of enabling multi-AZ on our clusters.

Library	Multi-AZ Enabled?	Recovery Time
`iosredis`	No	102s
`redis-py-cluster`	No	57s
`iosredis`	Yes	338s
`redis-py-cluster`	Yes	56s

We are experiencing an increase of over 300% in recovery time when utilizing ioredis and is over 6x the recovery time of redis-py-cluster.

Here is the Gist of the two tests executed by the AWS agent, and below is the full correspondence with them.

Please let me know if there is any additional information that can help with resolving the issue, thanks.

In order to test this I reproduced this scenario in the test environment using the following setup :
 
Client library : ioredis
 
$ npm list
 
ec2-user@ /home/ec2-user
└── [email protected]
 
Created 2 Cluster mode enabled clusters with below configuration :
 
test-redis-multi-az
===============
 
Engine version : 4.0.10
Node type : t3.small
multiAZEnabled : true
clusterEnabled": true
 
test-redis-multi-az-disabled
=======================
 
Engine version : 4.0.10
Node type : t3.small
multiAZEnabled : false
clusterEnabled": true
 
Below is the sample io redis program I used : test.js
 
var Redis = require("ioredis");
var cluster = new Redis.Cluster([{
  port: 6379,
  host: "<configuration-endpoint>",
  maxRedirections : 16,
  retryDelayOnFailover : 1000,
}]);
 
cluster.incr("key1");
cluster.get("key1", function(err, res) {
  console.log ('GET RESULT ==>' + res);
  process.exit();
});
 
The test.js code basically increments the key1's value to test write and then read the key1 to test Read operation. I used cluster configuration endpoint during the tests. Then I run while loop to call test.js during the Manual Test Failover for the ElastiCache Redis Cluster
 
Execute test.js -> while true ; do date ; node test.js ; sleep 1; done
 
Below are the test results :

IO REDIS -> Failover test
========
 
test-redis-multi-az -> Time taken to recover -> 5 minutes and 38 seconds
 
Sat Nov  6 10:02:26 UTC 2021
[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache.
    at tryNode (/home/ec2-user/node_modules/ioredis/built/cluster/index.js:396:31)
    at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:413:21
    at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:671:24
    at run (/home/ec2-user/node_modules/ioredis/built/utils/index.js:157:22)
    at tryCatcher (/home/ec2-user/node_modules/standard-as-callback/built/utils.js:12:23)
    at /home/ec2-user/node_modules/standard-as-callback/built/index.js:33:51
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
Sat Nov  6 10:08:04 UTC 2021

---

test-redis-multi-az-disabled -> Time taken to recover -> 1 minute and 42 seconds
 
Sat Nov  6 10:15:20 UTC 2021
[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache.
    at tryNode (/home/ec2-user/node_modules/ioredis/built/cluster/index.js:396:31)
    at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:413:21
    at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:671:24
    at run (/home/ec2-user/node_modules/ioredis/built/utils/index.js:157:22)
    at tryCatcher (/home/ec2-user/node_modules/standard-as-callback/built/utils.js:12:23)
    at /home/ec2-user/node_modules/standard-as-callback/built/index.js:33:51
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
Sat Nov  6 10:17:02 UTC 2021
 
The above test concurs with your finding that Cluster mode enabled clusters with multi AZ enabled is taking longer time to recover when we are doing test failover operation however in order to isolate if the issue is specific to IOREDIS client library I tested the same with python library -> RedisCluster however to my surprise the results where completely different.
 
Below are the results using python library RedisCluster
 
RedisCluster
=============
 
test-redis-multi-az -> Failover test -> Time taken to recover 0 minutes and 56 seconds
 
021-11-06 10:28:19,729 [setThread] [INFO   ]  Setting mykey with value "Nov 06 2021 10:28:19"
2021-11-06 10:28:19,729 [getThread] [INFO   ]  got mykey value: Nov 06 2021 10:28:19
2021-11-06 10:28:19,830 [getThread] [ERROR  ]  ConnectionError
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.7/site-packages/rediscluster/client.py", line 631, in
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
 
2021-11-06 10:29:15,350 [setThread] [INFO   ]  Setting mykey with value "Nov 06 2021 10:29:15"
 
test-redis-multi-az-disabled -> Failover test ->  Time taken to recover 0 minutes and 57 seconds
 
2021-11-06 10:37:06,191 [setThread] [INFO   ]  Setting mykey with value "Nov 06 2021 10:37:06"
2021-11-06 10:37:06,192 [getThread] [INFO   ]  got mykey value: Nov 06 2021 10:37:06
2021-11-06 10:37:06,292 [getThread] [ERROR  ]  ConnectionError
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.7/site-packages/rediscluster/client.py", line 631, in _execute_command
    return self.parse_response(connection, command, **kwargs)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/client.py", line 915, in parse_response
    response = connection.read_response()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 324, in read_response
    raw = self._buffer.readline()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 256, in readline
    self._read_from_socket()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 201, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
 
rediscluster.exceptions.ClusterDownError: The cluster is down
2021-11-06 10:38:03,137 [getThread] [INFO   ]  got mykey value: Nov 06 2021 10:38:03
 
 
Please note I have shared both ioredis program and python program which i used for testing in the below case correspondence.
 
In my python progam :
 
redis = RedisCluster(startup_nodes=[{"host": "<configuration-endpoint>","port": "6379"}], decode_responses=True,skip_full_coverage_check=True,health_check_interval=2, socket_timeout=3)
 
I have added -> health_check_interval , socket_timeout
 
The results were significantly better when I added health checks and a short timeout (2 seconds) to my connection object before which I expirenced higher downtime as the library keeps trying to submit the requests indefinitely to the current connection. I would recommend checking if IO redis has some similar functions as I do not have expirence working with io redis.

Feb 17 '22 20:02 JagandeepBrar

ioredis ioredis copied to clipboard

Long Recovery Time After Failover with AWS Multi-AZ Clusters

ioredis
ioredis copied to clipboard