ioredis
ioredis copied to clipboard
Long Recovery Time After Failover with AWS Multi-AZ Clusters
We are attempting to enable multi-AZ support on our AWS clusters but we are experiencing an unexpectedly high recovery time after a failover event.
We first approached AWS for support but they have shown us that this is not normal behaviour and only occurs when utilizing ioredis
. When utilizing the Python library redis-py-cluster
we experience a recovery time that is much more in-line with the expectations of enabling multi-AZ on our clusters.
Library | Multi-AZ Enabled? | Recovery Time |
---|---|---|
iosredis |
No | 102s |
redis-py-cluster |
No | 57s |
iosredis |
Yes | 338s |
redis-py-cluster |
Yes | 56s |
We are experiencing an increase of over 300% in recovery time when utilizing ioredis
and is over 6x the recovery time of redis-py-cluster
.
Here is the Gist of the two tests executed by the AWS agent, and below is the full correspondence with them.
Please let me know if there is any additional information that can help with resolving the issue, thanks.
In order to test this I reproduced this scenario in the test environment using the following setup :
Client library : ioredis
$ npm list
ec2-user@ /home/ec2-user
└── [email protected]
Created 2 Cluster mode enabled clusters with below configuration :
test-redis-multi-az
===============
Engine version : 4.0.10
Node type : t3.small
multiAZEnabled : true
clusterEnabled": true
test-redis-multi-az-disabled
=======================
Engine version : 4.0.10
Node type : t3.small
multiAZEnabled : false
clusterEnabled": true
Below is the sample io redis program I used : test.js
var Redis = require("ioredis");
var cluster = new Redis.Cluster([{
port: 6379,
host: "<configuration-endpoint>",
maxRedirections : 16,
retryDelayOnFailover : 1000,
}]);
cluster.incr("key1");
cluster.get("key1", function(err, res) {
console.log ('GET RESULT ==>' + res);
process.exit();
});
The test.js code basically increments the key1's value to test write and then read the key1 to test Read operation. I used cluster configuration endpoint during the tests. Then I run while loop to call test.js during the Manual Test Failover for the ElastiCache Redis Cluster
Execute test.js -> while true ; do date ; node test.js ; sleep 1; done
Below are the test results :
IO REDIS -> Failover test
========
test-redis-multi-az -> Time taken to recover -> 5 minutes and 38 seconds
Sat Nov 6 10:02:26 UTC 2021
[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/home/ec2-user/node_modules/ioredis/built/cluster/index.js:396:31)
at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:413:21
at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:671:24
at run (/home/ec2-user/node_modules/ioredis/built/utils/index.js:157:22)
at tryCatcher (/home/ec2-user/node_modules/standard-as-callback/built/utils.js:12:23)
at /home/ec2-user/node_modules/standard-as-callback/built/index.js:33:51
at processTicksAndRejections (node:internal/process/task_queues:96:5)
Sat Nov 6 10:08:04 UTC 2021
---
test-redis-multi-az-disabled -> Time taken to recover -> 1 minute and 42 seconds
Sat Nov 6 10:15:20 UTC 2021
[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/home/ec2-user/node_modules/ioredis/built/cluster/index.js:396:31)
at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:413:21
at /home/ec2-user/node_modules/ioredis/built/cluster/index.js:671:24
at run (/home/ec2-user/node_modules/ioredis/built/utils/index.js:157:22)
at tryCatcher (/home/ec2-user/node_modules/standard-as-callback/built/utils.js:12:23)
at /home/ec2-user/node_modules/standard-as-callback/built/index.js:33:51
at processTicksAndRejections (node:internal/process/task_queues:96:5)
Sat Nov 6 10:17:02 UTC 2021
The above test concurs with your finding that Cluster mode enabled clusters with multi AZ enabled is taking longer time to recover when we are doing test failover operation however in order to isolate if the issue is specific to IOREDIS client library I tested the same with python library -> RedisCluster however to my surprise the results where completely different.
Below are the results using python library RedisCluster
RedisCluster
=============
test-redis-multi-az -> Failover test -> Time taken to recover 0 minutes and 56 seconds
021-11-06 10:28:19,729 [setThread] [INFO ] Setting mykey with value "Nov 06 2021 10:28:19"
2021-11-06 10:28:19,729 [getThread] [INFO ] got mykey value: Nov 06 2021 10:28:19
2021-11-06 10:28:19,830 [getThread] [ERROR ] ConnectionError
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.7/site-packages/rediscluster/client.py", line 631, in
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
2021-11-06 10:29:15,350 [setThread] [INFO ] Setting mykey with value "Nov 06 2021 10:29:15"
test-redis-multi-az-disabled -> Failover test -> Time taken to recover 0 minutes and 57 seconds
2021-11-06 10:37:06,191 [setThread] [INFO ] Setting mykey with value "Nov 06 2021 10:37:06"
2021-11-06 10:37:06,192 [getThread] [INFO ] got mykey value: Nov 06 2021 10:37:06
2021-11-06 10:37:06,292 [getThread] [ERROR ] ConnectionError
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.7/site-packages/rediscluster/client.py", line 631, in _execute_command
return self.parse_response(connection, command, **kwargs)
File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/client.py", line 915, in parse_response
response = connection.read_response()
File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 739, in read_response
response = self._parser.read_response()
File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 324, in read_response
raw = self._buffer.readline()
File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 256, in readline
self._read_from_socket()
File "/home/ec2-user/.local/lib/python3.7/site-packages/redis/connection.py", line 201, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
rediscluster.exceptions.ClusterDownError: The cluster is down
2021-11-06 10:38:03,137 [getThread] [INFO ] got mykey value: Nov 06 2021 10:38:03
Please note I have shared both ioredis program and python program which i used for testing in the below case correspondence.
In my python progam :
redis = RedisCluster(startup_nodes=[{"host": "<configuration-endpoint>","port": "6379"}], decode_responses=True,skip_full_coverage_check=True,health_check_interval=2, socket_timeout=3)
I have added -> health_check_interval , socket_timeout
The results were significantly better when I added health checks and a short timeout (2 seconds) to my connection object before which I expirenced higher downtime as the library keeps trying to submit the requests indefinitely to the current connection. I would recommend checking if IO redis has some similar functions as I do not have expirence working with io redis.