redis-py icon indicating copy to clipboard operation
redis-py copied to clipboard

Overall revamp of connection pool, retries and timeouts

Open achimnol opened this issue 2 years ago • 6 comments

I'm experiencing many issues (already reported here by other people) related with the (blocking) connection pool in the asyncio version.

I had to rollback and pin my dependency version to 4.5.5 currently. (lablup/backend.ai#1620)

Since there are already many reports, here I would instead suggest some high-level design suggestions:

  • The redis-py users should be able to distinguish:
  • Improvements for timeouts
    • Distinguish:
      • connection ready timeout in the blocking connection pool
      • socket connection timeout
      • response timeout
    • These timeouts should be treated differently in the intrinsic retry subsystem and distinguishable by subclasses of ConnectionError and/or TimeoutError to ease writing user-defined retry mechanisms.
      • e.g., Within the connection ready timeout, we should silenty wait and retry until we get a connection from a blocking connection pool.
      • The response timeout for non-blocking (GET, SET, ...) and blocking commands (BLPOP, XREAD, ...) should be considered a different condition: active error vs. polling.
        • It would be nice to have explicit examples/docs on how to write a polling loop around blocking commands with proper timeout and retry handling.
        • Please refer: https://github.com/lablup/backend.ai/blob/833ed5477d57846e568b17fec35c82300111a519/src/ai/backend/common/redis_helper.py#L174
      • #2807
      • #2973
      • #2663
    • Maybe we could refer the design of aiohttp.ClientTimeout.
  • BlockingConnectionPool should be the default.
    • ConnectionPool's default max_connections should be a more reasonable number.
      • #2220
    • #2522
    • #3034
    • #3056
    • There are issues to resolve first, though...
      • #2995
      • #2983
      • #2998
        • #2997
        • #2859
          • #2755
            • #2749
      • #2992
        • #445
      • #3124
  • Better connection pool design and abstraction to embrace underyling transport type differences and errors with connections
    • #2523
    • #2773
    • #2832
    • #2727
    • #2636
    • #2695
    • #3000 (though this is a thread usecase)
      • #2883
    • #3014
    • #3026
    • #3043
  • The sentinel's connection pool should also have the blocking version.
    • Currently there is no BlockingSentinelConnectionPool.
    • #2956
    • We need to clearly define whether the delay after connecting to the sentinel but before connecting to the target master/slave is included in the socket connection timeout or not.
  • The new CLIENT SETINFO mechanism should be generalized.
    • #2682
    • What if a failure occurs during sending this command?
      • In my test cases which test retries with a sentinel master failover or a redis server restart, this CLIENT SETINFO breaks the retry semantic.
      • Could we enforce the intrinsic/user-defined retry on such event?
    • What if a user wants to insert additional commands like CLIENT SETNAME or CLIENT NO-EVICT on?
    • Maybe we could refactor it as a connection-establishment callback, whose failure is ignored before disposing the faulty connection.
      • #2980
        • #2965
  • Backport policy for bug fixes
    • Since I had to rollback to 4.5.5 after trying 4.6.0 → 5.0.1 upgrade with connection leak experience, it would be nice to have a backport policy for critical bug fixes.

achimnol avatar Oct 16 '23 03:10 achimnol

I'm not sure how to fix these problems without breaking the existing users. It seems that many of the problems are intertwined and having a better design and abstraction of connection and connection pools would be the way to go.

I understand that the list may look intimidating for the maintainers—I'm not urging to fix them ASAP, but hoping that this list would be a guide to embrace the issues in a more holistic view.

achimnol avatar Oct 16 '23 05:10 achimnol

For a new user it is really quite hard to understand whats going on with the connection ConnectionPool, Redis, and Connection objects. - E.g. I am trying to figure out if the retry functionality works with Azure AD authentication in the case that my password (token) expires while I am holding a connection from the pool.

Another couple of suggestions.

  • use with and async with to make sure, connections that are taken from the pool will be returned. E.g.
async with pool.take() as redis:
   await redis.set("foo","bar")
  • directly hand the user of the pool a working client (Redis) object
  • discard and recreate connections instead of trying to fix (reconnect them). Follow RAII (resource acquisition is initialization).
    • A connection will be connected when created. and only then added to the pool
    • a failure indicates the connection is broken. it will be cleaned up and removed from the pool
    • the pool will (already) automatically create new ones if required

Just some thoughts...

martinitus avatar Dec 20 '23 13:12 martinitus

Looking forward to what happens in #3200! :eyes:

achimnol avatar Apr 09 '24 02:04 achimnol

For a new user it is really quite hard to understand whats going on with the connection ConnectionPool, Redis, and Connection objects. - E.g. I am trying to figure out if the retry functionality works with Azure AD authentication in the case that my password (token) expires while I am holding a connection from the pool.

Do you have any clues or guidance on this? I am analyzing the same authentication, though in my case it's not just Redis, but also Celery, django-redis, python redis lock...

SorianoMarmol avatar Apr 26 '24 07:04 SorianoMarmol

With sentinel it becomes a whole other question also, which I cannot seem to find covered in the docs or examples and even reading the source is a bit confusing. Fundamentally what I want is a reliable object (or two if doing readonly) that I can say 'give me an active connection', but getting to that state is pretty hard...

Currently I create a sentinel, which in itself is a bit horrific:

      sentinel = aioredis.sentinel.Sentinel(  # type: ignore
          settings.redis_sentinel_hosts,
          **_standard_connection_kwargs(settings, **kwargs),
          sentinel_kwargs={
              **_standard_connection_kwargs(settings, **kwargs),
              "password": settings.redis_sentinel_password or settings.redis_password,
          },
      )

I can then use slave_for / master_for but these only seem to be single connections rather than returning a pool that I can use elsewhere. In the test suite it has only one test for SentinelConnectionPool which seems to just test that it tries a couple of hosts and then blows up. So at present I just use master_for / slave_for, but from what I can see from tcpdump on a live server this creates a new TCP conn each time rather than polling it (whereas in this case the sentinel connection is persistent).

mzealey avatar Nov 12 '24 13:11 mzealey

Is it possible to get an update on where / if the above issues sit on the roadmap? Is e.g. #3200 still intended to be developed?

mattjudge avatar Dec 16 '24 17:12 mattjudge