KATalogus reaches unworkable state when postgres session is closed due to connection issue
Describe the bug
Apr 09 10:24:59 boefjes.acc boefjes-boefje[30343]: sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)
De katalogus reaches a stuck state when (possibly) the database server (clustered, reachable via a load balancer) disconnects the session mid-tansaction.
Restarting the katalogus resolves the issue by clearing the pending rollback.
To Reproduce Steps to reproduce the behavior:
- Have some activity on katalogus
- Restart the database, hopefully you hit it somehwere in the middle of a transaction.
- Katalogus tries to do a rollback , im guessing in https://github.com/minvws/nl-kat-coordination/blob/main/boefjes/boefjes/sql/session.py#L31
- Notice the rollback cannot be performed, as no reconnect has taken place, blocking any subsequent action on the session.
Expected behavior Reconnect to commit the original transaction, or if not possibly, clear the rollback if the session cannot be recovered.
OpenKAT version All
Additional context
https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects
Notes:
For the boefjes this makes sense as we do not properly rollback on errors. However in the KATalogus we consistently get an OperationalError on each request. Since we have pool_pre_ping=True and the documentation of SQLAlchemy states:
The above example illustrates that no special intervention is needed to refresh the pool, which continues normally after a disconnection event is detected. However, one database exception is raised, per each connection that is in use while the database unavailability event occurred. In a typical web application using an ORM Session, the above condition would correspond to a single request failing with a 500 error, then the web application continuing normally beyond that. Hence the approach is “optimistic” in that frequent database restarts are not anticipated.
It is not clear to me why this persists until we restart the application yet.
For the boefje runner I think that it would also make more sense if it would use the katalogus API instead of directly connecting to the database.
For the boefje runner I think that it would also make more sense if it would use the katalogus API instead of directly connecting to the database.
I agree, the database might not be in the same network, and keeping the type of transactions between our local and remote (boefjerunner) services limited to http(s) only makes networking / firewalling easier.