redis-py icon indicating copy to clipboard operation
redis-py copied to clipboard

asyncio redis cluster: fixed reconnection when whole cluster goes down

Open adaamz opened this issue 2 years ago • 1 comments

Pull Request check-list

  • [x] Do tests and lints pass with this change?
  • [x] Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
  • [X] Is the new or changed code fully tested?
  • [ ] Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
  • [ ] Is there an example added to the examples folder (if applicable)?
  • [ ] Was the change added to CHANGES file?

Description of change

When redis cluster goes completely down (all nodes are offline) then redis-py library is unable to reconnect to them when at least startup nodes are back online. This workaround worked for me, but I'm not sure if there is some more efficient way to achieve same bugfix.

I testesd it isolately with spawning minimal redis cluster in docker and then taking it down and up after few seconds to test how the app reacts on this. My python test script looks like this:

import asyncio
import time
import traceback

from redis.asyncio.cluster import ClusterNode, RedisCluster

redis_host = "172.26.0.2"
redis_port = 6379
key = "some_key"


def prepare_redis_async():
    return RedisCluster(
        ssl=False,
        startup_nodes=[
            ClusterNode(host=redis_host, port=redis_port)
        ],
        socket_connect_timeout=3,
        socket_timeout=60,
        require_full_coverage=True,
        max_connections=5,
        cluster_error_retry_attempts=20
    )


redis_client_async = prepare_redis_async()


async def async_get_value():
    x = await redis_client_async.get(key)
    print(x)
    await redis_client_async.set(key, "async_test")

if __name__ == '__main__':
    loop = asyncio.get_event_loop()

    while True:
        try:
            loop.run_until_complete(async_get_value())
        except:
            print(traceback.format_exc())

        time.sleep(0.1)

adaamz avatar Jan 14 '24 20:01 adaamz

@chayim Hello, is there anything I can do to get this PR reviewed? Thanks

adaamz avatar Feb 26 '24 11:02 adaamz

Hi @adaamz, thank you for the time and effort you put into this PR! I'm closing it as the issue has already been addressed in PR #3646.

petyaslavova avatar May 29 '25 09:05 petyaslavova

@petyaslavova but this doesn't solve my issue, or is it?

When all nodes goes down then all nodes are removed fromt he list and when they are up again the list is still empty and we are unable to use those nodes again.

adaamz avatar May 29 '25 10:05 adaamz

@adaamz, you should instantiate your cluster with dynamic_startup_nodes=False. This ensures that the initial list of startup_nodes won't be overwritten by nodes discovered from the cluster, allowing the original addresses to remain available after a cluster recovery.

petyaslavova avatar May 29 '25 10:05 petyaslavova

@petyaslavova Thanks. What about in case we set one DNS node and then the library should discover rest of nodes in cluster? Will it discover whole cluster when connecting to only set node or it will just communicate just with the one configured node?

adaamz avatar May 29 '25 19:05 adaamz

? Will it discover whole cluster when connecting to only set node or it will just communicate just with the one configured node?

@adaamz Yes, it will. And the nodes will be used for almost the whole communication. The only request that will be sent to the initial startup node/s is the one for cluster topology extraction.

petyaslavova avatar May 30 '25 04:05 petyaslavova