asyncpg.exceptions.ConnectionDoesNotExistError: connection was closed in the middle of operation
Hi, sometimes I get the above exception. Once it happens, all following DB Queries are raising this exception, so I have to restart the entire web app. The Trace-Back looks something like this:
File "/app/app.py", line 359, in process_list
async with DB.transaction():
File "/piccolo/piccolo/engine/postgres.py", line 173, in __aenter__
self.connection = await self.engine.get_new_connection()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/piccolo/piccolo/engine/postgres.py", line 389, in get_new_connection
return await asyncpg.connect(**self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib64/python3.11/site-packages/asyncpg/connection.py", line 2092, in connect
return await connect_utils._connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib64/python3.11/site-packages/asyncpg/connect_utils.py", line 881, in _connect
return await _connect_addr(
^^^^^^^^^^^^^^^^^^^^
File "/venv/lib64/python3.11/site-packages/asyncpg/connect_utils.py", line 773, in _connect_addr
return await __connect_addr(params, timeout, True, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib64/python3.11/site-packages/asyncpg/connect_utils.py", line 831, in __connect_addr
await compat.wait_for(connected, timeout=timeout)
File "/venv/lib64/python3.11/site-packages/asyncpg/compat.py", line 56, in wait_for
return await asyncio.wait_for(fut, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/asyncio/tasks.py", line 479, in wait_for
return fut.result()
^^^^^^^^^^^^
asyncpg.exceptions.ConnectionDoesNotExistError: connection was closed in the middle of operation
Looks like some other people have encountered similar problems with asyncpg: https://github.com/MagicStack/asyncpg/issues/309
What's interesting about your traceback is you're making a direct connection to Postgres, and not acquiring a connection from a pool. With a pool I can imagine many potential causes (e.g. a connection being killed off for being too old), but as it's a fresh connection it's quite strange.
Is there any chance that your app has exhausted all connections to Postgres? By default it's set to something like 100. You would expect a different exception than ConnectionDoesNotExistError though.
Is there any chance that your app has exhausted all connections to Postgres?
I don't think so, it just randomly seems to happen every now and then.
Do you think it would be feasible to handle this in this library ? So that in get_new_connection we somehow restart (?) asyncpg, if we get ConnectionDoesNotExistError? Because currently I still need to restart the server manually.
I think adding it to this library would have the benefit that not everyone needs to write his/her own workaround to this bug which seems to possibly affect everyone(?) using the postgres engine, even though seldom.
What do you think?
Do you think it would be feasible to handle this in this library ? So that in get_new_connection we somehow restart (?) asyncpg, if we get ConnectionDoesNotExistError? Because currently I still need to restart the server manually.
I'm not sure. We could try fetching another connection after a short asyncio.sleep.
Restarting the server will kill all existing connections. I'm not sure if we could do it in a graceful way.
What kind of stack are you running? I read through this thread again: https://github.com/MagicStack/asyncpg/issues/309 and it seems like the following could be true:
uvloopis somehow to blame- The Postgres server is under resourced
- Some kind of network middleware is killing connections
Are you running any real long transactions?
I had a quick look through the Sentry logs for some of my apps, and I've seen TimeoutError when trying to get a connection, but not so much of ConnectionDoesNotExistError.
What kind of stack are you running
I indeed use uvloop. I don't use docker, the Starlette App and Postgres both run "natively" on the system.
In the postgres log it says:
LOG: could not receive data from client: Connection reset by peer
Are you running any real long transactions?
I don't think so, max should be ~1-2s
What I wonder about.. The ConnectionDoesNotExistError to me seems to indicate that a "connection reuse" is happening on some level. Maybe it is possible to force asyngpg to really use a new connection? Currently get_new_connection uses asyncpg.connect. Maybe the stored connections are available via some private attribute, which we could clear as a workaround?
I indeed use uvloop. I don't use docker, the Starlette App and Postgres both run "natively" on the system.
So the stack isn't likely to be causing the problem. It could be uvloop ... but kind of hard to test if the error only occurs sporadically.
I don't think so, max should be ~1-2s
Transactions don't sound like the problem either.
The ConnectionDoesNotExistError to me seems to indicate that a "connection reuse" is happening on some level.
Or that the connection is killed part way through an operation.
Maybe the stored connections are available via some private attribute, which we could clear as a workaround?
I would need to look more into the internals of asyncpg.
At what point does the error get raised? Immediately after get_new_connection, or part way through a query?
What is interesting is that this Exception, once it was raised a single time, seems to be raised for all subsequent queries. In the asyncpg log however the line Connection reset by peer is only shown once.
And here someone was able to solve this problem by changing the keepalive of tcp connections. So this are clues that I think indicate that there could be a faulty/closed connection in some sort of cache - But maybe it has other causes.
Immediately after get_new_connection, or part way through a query?
I think it happens during get_new_connection, when trying to connect to postgres. The first stack trace looks the same as the ones following it
but kind of hard to test if the error only occurs sporadically
Today I tested:
- Cancel Task that does DB Operation (insert) after different delays
- Do many inserts concurrently
- Do simple DB Operation (insert) after increasing delays (15s to >1h)
Sadly, none of these were able to reproduce this. With 1. I sometimes got asyncio.exceptions.InvalidStateError:, with 2. I got asyncpg.exceptions.TooManyConnectionsError:.
However I realized that my App wasn't using connection pooling at all, since this has to be enabled manually. Since this ConnectionDoesNotExistError seems to have been thrown when creating new connections, I hope that this also reduces the (overall) chance of this bug occurring (since now, way fewer new connections should need to be created)