redis-py Overall revamp of connection pool, retries and timeouts

I'm experiencing many issues (already reported here by other people) related with the (blocking) connection pool in the asyncio version.

I had to rollback and pin my dependency version to 4.5.5 currently. (lablup/backend.ai#1620)

Since there are already many reports, here I would instead suggest some high-level design suggestions:

The redis-py users should be able to distinguish:
- The connection is closed by myself.
- The connection is actively closed by the server after sending responses.
- The connection is abruptly closed by the server without sending responses.
- The client is evicted due to a server-side limit.
Improvements for timeouts
- Distinguish:
  - connection ready timeout in the blocking connection pool
  - socket connection timeout
  - response timeout
- These timeouts should be treated differently in the intrinsic retry subsystem and distinguishable by subclasses of ConnectionError and/or TimeoutError to ease writing user-defined retry mechanisms.
  - e.g., Within the connection ready timeout, we should silenty wait and retry until we get a connection from a blocking connection pool.
  - The response timeout for non-blocking (GET, SET, ...) and blocking commands (BLPOP, XREAD, ...) should be considered a different condition: active error vs. polling.
    - It would be nice to have explicit examples/docs on how to write a polling loop around blocking commands with proper timeout and retry handling.
    - Please refer: https://github.com/lablup/backend.ai/blob/833ed5477d57846e568b17fec35c82300111a519/src/ai/backend/common/redis_helper.py#L174
  - #2807
  - #2973
  - #2663
- Maybe we could refer the design of aiohttp.ClientTimeout.
BlockingConnectionPool should be the default.
- ConnectionPool's default max_connections should be a more reasonable number.
  - #2220
- #2522
- #3034
- #3056
- There are issues to resolve first, though...
  - #2995
  - #2983
  - #2998
    - #2997
    - #2859
      - #2755
        
        #2749
  - #2992
    - #445
  - #3124
Better connection pool design and abstraction to embrace underyling transport type differences and errors with connections
- #2523
- #2773
- #2832
- #2727
- #2636
- #2695
- #3000 (though this is a thread usecase)
  - #2883
- #3014
- #3026
- #3043
The sentinel's connection pool should also have the blocking version.
- Currently there is no BlockingSentinelConnectionPool.
- #2956
- We need to clearly define whether the delay after connecting to the sentinel but before connecting to the target master/slave is included in the socket connection timeout or not.
The new CLIENT SETINFO mechanism should be generalized.
- #2682
- What if a failure occurs during sending this command?
  - In my test cases which test retries with a sentinel master failover or a redis server restart, this CLIENT SETINFO breaks the retry semantic.
  - Could we enforce the intrinsic/user-defined retry on such event?
- What if a user wants to insert additional commands like CLIENT SETNAME or CLIENT NO-EVICT on?
- Maybe we could refactor it as a connection-establishment callback, whose failure is ignored before disposing the faulty connection.
  - #2980
    - #2965
Backport policy for bug fixes
- Since I had to rollback to 4.5.5 after trying 4.6.0 → 5.0.1 upgrade with connection leak experience, it would be nice to have a backport policy for critical bug fixes.

Oct 16 '23 03:10 achimnol

I'm not sure how to fix these problems without breaking the existing users. It seems that many of the problems are intertwined and having a better design and abstraction of connection and connection pools would be the way to go.

I understand that the list may look intimidating for the maintainers—I'm not urging to fix them ASAP, but hoping that this list would be a guide to embrace the issues in a more holistic view.

Oct 16 '23 05:10 achimnol

For a new user it is really quite hard to understand whats going on with the connection ConnectionPool, Redis, and Connection objects. - E.g. I am trying to figure out if the retry functionality works with Azure AD authentication in the case that my password (token) expires while I am holding a connection from the pool.

Another couple of suggestions.

use with and async with to make sure, connections that are taken from the pool will be returned. E.g.

async with pool.take() as redis:
   await redis.set("foo","bar")

directly hand the user of the pool a working client (Redis) object
discard and recreate connections instead of trying to fix (reconnect them). Follow RAII (resource acquisition is initialization).
- A connection will be connected when created. and only then added to the pool
- a failure indicates the connection is broken. it will be cleaned up and removed from the pool
- the pool will (already) automatically create new ones if required

Just some thoughts...

Dec 20 '23 13:12 martinitus

Looking forward to what happens in #3200! :eyes:

Apr 09 '24 02:04 achimnol

For a new user it is really quite hard to understand whats going on with the connection ConnectionPool, Redis, and Connection objects. - E.g. I am trying to figure out if the retry functionality works with Azure AD authentication in the case that my password (token) expires while I am holding a connection from the pool.

Do you have any clues or guidance on this? I am analyzing the same authentication, though in my case it's not just Redis, but also Celery, django-redis, python redis lock...

Apr 26 '24 07:04 SorianoMarmol

With sentinel it becomes a whole other question also, which I cannot seem to find covered in the docs or examples and even reading the source is a bit confusing. Fundamentally what I want is a reliable object (or two if doing readonly) that I can say 'give me an active connection', but getting to that state is pretty hard...

Currently I create a sentinel, which in itself is a bit horrific:

      sentinel = aioredis.sentinel.Sentinel(  # type: ignore
          settings.redis_sentinel_hosts,
          **_standard_connection_kwargs(settings, **kwargs),
          sentinel_kwargs={
              **_standard_connection_kwargs(settings, **kwargs),
              "password": settings.redis_sentinel_password or settings.redis_password,
          },
      )

I can then use slave_for / master_for but these only seem to be single connections rather than returning a pool that I can use elsewhere. In the test suite it has only one test for SentinelConnectionPool which seems to just test that it tries a couple of hosts and then blows up. So at present I just use master_for / slave_for, but from what I can see from tcpdump on a live server this creates a new TCP conn each time rather than polling it (whereas in this case the sentinel connection is persistent).

Nov 12 '24 13:11 mzealey

Is it possible to get an update on where / if the above issues sit on the roadmap? Is e.g. #3200 still intended to be developed?

Dec 16 '24 17:12 mattjudge