Overall revamp of connection pool, retries and timeouts
I'm experiencing many issues (already reported here by other people) related with the (blocking) connection pool in the asyncio version.
I had to rollback and pin my dependency version to 4.5.5 currently. (lablup/backend.ai#1620)
Since there are already many reports, here I would instead suggest some high-level design suggestions:
- The redis-py users should be able to distinguish:
- The connection is closed by myself.
- The connection is actively closed by the server after sending responses.
- The connection is abruptly closed by the server without sending responses.
- The client is evicted due to a server-side limit.
- Improvements for timeouts
- Distinguish:
- connection ready timeout in the blocking connection pool
- socket connection timeout
- response timeout
- These timeouts should be treated differently in the intrinsic retry subsystem and distinguishable by subclasses of
ConnectionErrorand/orTimeoutErrorto ease writing user-defined retry mechanisms.- e.g., Within the connection ready timeout, we should silenty wait and retry until we get a connection from a blocking connection pool.
- The response timeout for non-blocking (
GET,SET, ...) and blocking commands (BLPOP,XREAD, ...) should be considered a different condition: active error vs. polling.- It would be nice to have explicit examples/docs on how to write a polling loop around blocking commands with proper timeout and retry handling.
- Please refer: https://github.com/lablup/backend.ai/blob/833ed5477d57846e568b17fec35c82300111a519/src/ai/backend/common/redis_helper.py#L174
- #2807
- #2973
- #2663
- Maybe we could refer the design of
aiohttp.ClientTimeout.
- Distinguish:
-
BlockingConnectionPoolshould be the default.-
ConnectionPool's defaultmax_connectionsshould be a more reasonable number.- #2220
- #2522
- #3034
- #3056
- There are issues to resolve first, though...
- #2995
- #2983
- #2998
- #2997
- #2859
- #2755
- #2749
- #2755
- #2992
- #445
- #3124
-
- Better connection pool design and abstraction to embrace underyling transport type differences and errors with connections
- #2523
- #2773
- #2832
- #2727
- #2636
- #2695
- #3000 (though this is a thread usecase)
- #2883
- #3014
- #3026
- #3043
- The sentinel's connection pool should also have the blocking version.
- Currently there is no
BlockingSentinelConnectionPool. - #2956
- We need to clearly define whether the delay after connecting to the sentinel but before connecting to the target master/slave is included in the socket connection timeout or not.
- Currently there is no
- The new
CLIENT SETINFOmechanism should be generalized.- #2682
- What if a failure occurs during sending this command?
- In my test cases which test retries with a sentinel master failover or a redis server restart, this
CLIENT SETINFObreaks the retry semantic. - Could we enforce the intrinsic/user-defined retry on such event?
- In my test cases which test retries with a sentinel master failover or a redis server restart, this
- What if a user wants to insert additional commands like
CLIENT SETNAMEorCLIENT NO-EVICT on? - Maybe we could refactor it as a connection-establishment callback, whose failure is ignored before disposing the faulty connection.
- #2980
- #2965
- #2980
- Backport policy for bug fixes
- Since I had to rollback to 4.5.5 after trying 4.6.0 → 5.0.1 upgrade with connection leak experience, it would be nice to have a backport policy for critical bug fixes.
I'm not sure how to fix these problems without breaking the existing users. It seems that many of the problems are intertwined and having a better design and abstraction of connection and connection pools would be the way to go.
I understand that the list may look intimidating for the maintainers—I'm not urging to fix them ASAP, but hoping that this list would be a guide to embrace the issues in a more holistic view.
For a new user it is really quite hard to understand whats going on with the connection ConnectionPool, Redis, and Connection objects. - E.g. I am trying to figure out if the retry functionality works with Azure AD authentication in the case that my password (token) expires while I am holding a connection from the pool.
Another couple of suggestions.
- use
withandasync withto make sure, connections that are taken from the pool will be returned. E.g.
async with pool.take() as redis:
await redis.set("foo","bar")
- directly hand the user of the pool a working client (
Redis) object - discard and recreate connections instead of trying to fix (reconnect them). Follow RAII (resource acquisition is initialization).
- A connection will be connected when created. and only then added to the pool
- a failure indicates the connection is broken. it will be cleaned up and removed from the pool
- the pool will (already) automatically create new ones if required
Just some thoughts...
Looking forward to what happens in #3200! :eyes:
For a new user it is really quite hard to understand whats going on with the connection
ConnectionPool,Redis, andConnectionobjects. - E.g. I am trying to figure out if the retry functionality works with Azure AD authentication in the case that my password (token) expires while I am holding a connection from the pool.
Do you have any clues or guidance on this? I am analyzing the same authentication, though in my case it's not just Redis, but also Celery, django-redis, python redis lock...
With sentinel it becomes a whole other question also, which I cannot seem to find covered in the docs or examples and even reading the source is a bit confusing. Fundamentally what I want is a reliable object (or two if doing readonly) that I can say 'give me an active connection', but getting to that state is pretty hard...
Currently I create a sentinel, which in itself is a bit horrific:
sentinel = aioredis.sentinel.Sentinel( # type: ignore
settings.redis_sentinel_hosts,
**_standard_connection_kwargs(settings, **kwargs),
sentinel_kwargs={
**_standard_connection_kwargs(settings, **kwargs),
"password": settings.redis_sentinel_password or settings.redis_password,
},
)
I can then use slave_for / master_for but these only seem to be single connections rather than returning a pool that I can use elsewhere. In the test suite it has only one test for SentinelConnectionPool which seems to just test that it tries a couple of hosts and then blows up. So at present I just use master_for / slave_for, but from what I can see from tcpdump on a live server this creates a new TCP conn each time rather than polling it (whereas in this case the sentinel connection is persistent).
Is it possible to get an update on where / if the above issues sit on the roadmap? Is e.g. #3200 still intended to be developed?