LISTEN/NOTIFY sometimes drops without recovery
Summary
We are using Graphile Worker in production for a while now, more specifically this PR (#474). It works remarkably well. However, every few months the LISTEN/NOTIFY connection seems to drop.
Our logs show ECONNREFUSED at about the time this starts.
ERROR: Failed during pool sweep (migrationNumber=19): Error: connect ECONNREFUSED
...
ERROR: Failed to update heartbeat for pool pool-7699a0aba218dd3402: Error: connect ECONNREFUSED
After that, it is pretty obvious from the log frequency that we only poll every few minutes and not using LISTEN/NOTIFY anymore.
(upload not working, retrying in a few minutes)
The "fix" is to simply restart.
Steps to reproduce
Not really sure, just happens irregularly. Sorry!
Expected results
A worker should recover.
Actual results
LISTEN/NOTIFY never recovers.
Additional context
Postgres v15 Node 20
Hey @benjie, we were seeing this again now during a large cpu spike and I believe it happens when the db goes into recovery mode.
Ah, that would make sense! Thanks for following up. Can you confirm the specific version of Graphile Worker you are using?
Ah you said already you're using the canary. Okay that should be enough for me to try and reproduce it 👍
I cannot reproduce this, and tracing the code I see no reason why LISTEN/NOTIFY would stop working - it should go offline for at most 90 seconds.
thanks for checking! had this again last night. the logs are
[core] ERROR: Failed to update heartbeat for pool pool-65fe4ebebafe633919: Error: connect ECONNREFUSED 3.75.38.74:5432
[core] ERROR: Failed during pool sweep (migrationNumber=19): error: the database system is not accepting connections
Those logs are harmless information. I've just read through their related source code, and they're part of error recovery mechanisms - they should not interfere with LISTEN/NOTIFY. 🤔
Maybe we've misdiagnosed the issue and LISTEN/NOTIFY is still working, but there's no "free" worker to do the work? Without a reproducible example it's incredibly tough to track this down.
Still cannot reproduce this, but I've released v0.17.0-rc.0 which adds more logging 🤷♂️