redis-py async redis cluster should use initial startup nodes during reinitialization in case of failover

Version: redis-py: 4.3.3 redis: 7.0.2

Platform: python 3.7

Description: Async implementation of RedisCluster overwrites initial startup nodes during first initialization. This is caused by this line.

When deployed in dynamic systems like Kubernetes your shards may sometimes change their IP addresses. This likely leads to a situation when python client ends up with a pool of invalid (outdated) startup nodes and can not perform reinitialization, since NodeManager uses raw IP addresses.

For example, let's say we have cluster with 3 masters in k8s and use python client this way:

from redis.asyncio import RedisCluster

client = RedisCluster.from_url('redis://redis-cluster-headless:6379/0')
print(client.nodes_manager.startup_nodes)

# outputs {'redis-cluster-headless:6379': [host=redis-cluster-headless, port=6379, name=redis-cluster-headless:6379, server_type=None]}

await client # initialization
print(client.nodes_manager.startup_nodes)

# outputs something like this
# {'<ip1>:6379': [host=<ip1>, port=6379, name=<ip1>:6379, server_type=primary],
#  '<ip2>:6379': [host=<ip2>, port=6379, name=<ip2>:6379, server_type=primary],
#  '<ip3>:6379': [host=<ip3>, port=6379, name=<ip3>:6379, server_type=primary]}

now image your python client is idle for some time without requests, and during that period K8S have recreated each redis container due to whatever reason (crash, or moved to another node by autoscaler for example).

Then, next time python client will try to perform a redis call it will find out that all ClusterNodes are unavaliable, will try to reinitialize NodeManager and end up with no connectable node to get new cluster configuration from. This would not be an issue if initial headless URL was still in the pool of startup nodes.

Sync implementation of RedisCluster preserves initial node address just as suggested above, so i believe async implementation should behave the same.

For now I use the following patch, but I can't call it an elegant way to get around the problem. It must be solved through changes to NodeManager.

class AsyncRedisCluster(RedisCluster):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._initial_startup_nodes = dict(self.nodes_manager.startup_nodes)

    async def initialize(self):
        rv = await super().initialize()
        self.nodes_manager.startup_nodes.update(self._initial_startup_nodes)
        return rv

P.S. sync implementation has the opposite problem, NodesManager doesn't pop unused addresses from startup_nodes storage => it may grow indefinitely as soon as your containers change IPs.

Nov 26 '22 19:11 artesby

I'll be happy to review any PRs with a fix.

Nov 29 '22 11:11 utkarshgupta137

Hi! Is there any update on this? I got the same issue when I use AWS MemoryDB. got below exception(the same one mentioned above) when all the node was down and recovered.

Redis Cluster cannot be connected. Please provide at least one reachable node: None

I thought doing some reinitialize would be working, but it was not.

Feb 12 '23 02:02 jiyeonseo

Hi! I got the same issue too when I use AWS redis. Exactly the same exception as Jiyeonseo posted:

"redis_error": "Redis errors out with exception: Redis Cluster cannot be connected. Please provide at least one reachable node: None"

Is there any updates on this issue? Thanks!

Feb 20 '23 14:02 shuyingz

Heads up that the "this line" link in the original code description pointed at main and has thus become stale.

Given that state of the repository at the time the issue was created, the line in question is https://github.com/redis/redis-py/blob/d3a3ada03e080f39144807c9fbe44876c40e0548/redis/asyncio/cluster.py#L1337

Oct 29 '23 22:10 thundergolfer

Seems like adding dynamic_startup_nodes=False in cluster constructor solves the problem, at least for sync client.

May 21 '24 11:05 andr-c