worker icon indicating copy to clipboard operation
worker copied to clipboard

LISTEN/NOTIFY sometimes drops without recovery

Open psteinroe opened this issue 7 months ago • 7 comments

Summary

We are using Graphile Worker in production for a while now, more specifically this PR (#474). It works remarkably well. However, every few months the LISTEN/NOTIFY connection seems to drop.

Our logs show ECONNREFUSED at about the time this starts.

ERROR: Failed during pool sweep (migrationNumber=19): Error: connect ECONNREFUSED
...
ERROR: Failed to update heartbeat for pool pool-7699a0aba218dd3402: Error: connect ECONNREFUSED

After that, it is pretty obvious from the log frequency that we only poll every few minutes and not using LISTEN/NOTIFY anymore.

(upload not working, retrying in a few minutes)

The "fix" is to simply restart.

Steps to reproduce

Not really sure, just happens irregularly. Sorry!

Expected results

A worker should recover.

Actual results

LISTEN/NOTIFY never recovers.

Additional context

Postgres v15 Node 20

psteinroe avatar Apr 29 '25 07:04 psteinroe

Hey @benjie, we were seeing this again now during a large cpu spike and I believe it happens when the db goes into recovery mode.

psteinroe avatar Jun 06 '25 14:06 psteinroe

Ah, that would make sense! Thanks for following up. Can you confirm the specific version of Graphile Worker you are using?

benjie avatar Jun 07 '25 15:06 benjie

Ah you said already you're using the canary. Okay that should be enough for me to try and reproduce it 👍

benjie avatar Jun 07 '25 15:06 benjie

I cannot reproduce this, and tracing the code I see no reason why LISTEN/NOTIFY would stop working - it should go offline for at most 90 seconds.

benjie avatar Jun 11 '25 09:06 benjie

thanks for checking! had this again last night. the logs are

[core] ERROR: Failed to update heartbeat for pool pool-65fe4ebebafe633919: Error: connect ECONNREFUSED 3.75.38.74:5432

[core] ERROR: Failed during pool sweep (migrationNumber=19): error: the database system is not accepting connections

psteinroe avatar Jun 21 '25 19:06 psteinroe

Those logs are harmless information. I've just read through their related source code, and they're part of error recovery mechanisms - they should not interfere with LISTEN/NOTIFY. 🤔

Maybe we've misdiagnosed the issue and LISTEN/NOTIFY is still working, but there's no "free" worker to do the work? Without a reproducible example it's incredibly tough to track this down.

benjie avatar Jul 13 '25 10:07 benjie

Still cannot reproduce this, but I've released v0.17.0-rc.0 which adds more logging 🤷‍♂️

benjie avatar Jul 29 '25 15:07 benjie