nl-kat-coordination icon indicating copy to clipboard operation
nl-kat-coordination copied to clipboard

KATalogus reaches unworkable state when postgres session is closed due to connection issue

Open underdarknl opened this issue 11 months ago • 2 comments

Describe the bug

Apr 09 10:24:59 boefjes.acc boefjes-boefje[30343]: sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

De katalogus reaches a stuck state when (possibly) the database server (clustered, reachable via a load balancer) disconnects the session mid-tansaction.

Restarting the katalogus resolves the issue by clearing the pending rollback.

To Reproduce Steps to reproduce the behavior:

  1. Have some activity on katalogus
  2. Restart the database, hopefully you hit it somehwere in the middle of a transaction.
  3. Katalogus tries to do a rollback , im guessing in https://github.com/minvws/nl-kat-coordination/blob/main/boefjes/boefjes/sql/session.py#L31
  4. Notice the rollback cannot be performed, as no reconnect has taken place, blocking any subsequent action on the session.

Expected behavior Reconnect to commit the original transaction, or if not possibly, clear the rollback if the session cannot be recovered.

OpenKAT version All

Additional context

https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects

underdarknl avatar Apr 10 '25 07:04 underdarknl

Notes:

For the boefjes this makes sense as we do not properly rollback on errors. However in the KATalogus we consistently get an OperationalError on each request. Since we have pool_pre_ping=True and the documentation of SQLAlchemy states:

The above example illustrates that no special intervention is needed to refresh the pool, which continues normally after a disconnection event is detected. However, one database exception is raised, per each connection that is in use while the database unavailability event occurred. In a typical web application using an ORM Session, the above condition would correspond to a single request failing with a 500 error, then the web application continuing normally beyond that. Hence the approach is “optimistic” in that frequent database restarts are not anticipated.

It is not clear to me why this persists until we restart the application yet.

Donnype avatar Apr 10 '25 09:04 Donnype

For the boefje runner I think that it would also make more sense if it would use the katalogus API instead of directly connecting to the database.

dekkers avatar Apr 10 '25 10:04 dekkers

For the boefje runner I think that it would also make more sense if it would use the katalogus API instead of directly connecting to the database.

I agree, the database might not be in the same network, and keeping the type of transactions between our local and remote (boefjerunner) services limited to http(s) only makes networking / firewalling easier.

underdarknl avatar Apr 14 '25 19:04 underdarknl