worker
worker copied to clipboard
Can named queues become permanently locked when a job reaches max_attempts?
Summary
We're heavily using 'Named Queues' (setting the queue_name
column in jobs
table) now and seeing issues with a job becoming locked and never getting unlocked. This blocks the whole queue.
Steps to reproduce
Add many jobs to the system with the graphile_worker.add_job
function, and spread the jobs across many queue_name
's.
Example: {JobName-0} - {JobName-n} where N is maybe 20-30 (to push the system).
Run the system until you see a job that is 1) locked 2) has attempts = maxAttempts.
Expected results
Each queue should process the jobs as expected. Jobs should get locked, processed, and unlocked as expected.
Actual results
Out of N queues, one will become locked. The pattern were seeing is the queue gets locked and
- created_at = run_at - This means the issue is occurring on the first run
- attempts = max_attempts - so 25 = 25 if using the default
- The job never gets unlocked
This MIGHT be the error thats occurring, but its not terribly helpful other than a Postgres connection is occuring.
Error: Connection terminated unexpectedly
at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:509:28)
at Connection.emit (node:events:390:28)
at Connection.emit (node:domain:475:12)
at TLSSocket.<anonymous> (/app/node_modules/pg/lib/connection.js:107:12)
at TLSSocket.emit (node:events:402:35)
at TLSSocket.emit (node:domain:475:12)
at endReadableNT (node:internal/streams/readable:1343:12)
at processTicksAndRejections (node:internal/process/task_queues:83:21)
Emitted 'error' event on Client instance at:
at Client._handleErrorEvent (/app/node_modules/pg/lib/client.js:319:10)
at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:149:16)
at Object.onceWrapper (node:events:509:28)
[... lines matching original stack trace ...]
at processTicksAndRejections (node:internal/process/task_queues:83:21)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed.
Exit code: 1
Command: /usr/local/bin/node
Arguments: /opt/yarn-v1.22.15/lib/cli.js start
Directory: /app/packages/worker
Output:
Additional context
I restarted the whole system at 19:02 (cleared the tables and restarted 16 worker processes). At 19:04 the first locking issue occurred.
Also, and this may be a coincidence, but we just noticed that:
- Instance1's UpdatedAt = Instance2's CreatedAt
- Instance2's UpdatedAt = Instance3's CreatedAt
I also searched Pod logs for worker-cec1a1fada9a164606
(lock from Instance #1) and nothing showed up - which signals the process died. Also, out of the 16 worker processes, there were 6 that had restarted (7 total restarts).
INSTANCE #1
{
"id": 217329284,
"queue_name": "MatchingFillShift-4",
"task_identifier": "MatchingFillShift",
"payload": "{\"shift_id\":\"194369\",\"invites_to_send\":1}",
"priority": 0,
"run_at": "2022-03-15 19:04:16.385935+00",
"attempts": 10,
"max_attempts": 10,
"last_error": null,
"created_at": "2022-03-15 19:04:16.385935+00",
"updated_at": "2022-03-15 19:34:10.955446+00",
"key": null,
"locked_at": "2022-03-15 19:04:47.846168+00",
"locked_by": "worker-cec1a1fada9a164606",
"revision": 0,
"flags": null
}
INSTANCE #2
{
"id": 217333027,
"queue_name": "MatchingFillShift-3",
"task_identifier": "MatchingFillShift",
"payload": "{\"shift_id\":\"194528\",\"invites_to_send\":1}",
"priority": 0,
"run_at": "2022-03-15 19:34:10.955446+00",
"attempts": 10,
"max_attempts": 10,
"last_error": null,
"created_at": "2022-03-15 19:34:10.955446+00",
"updated_at": "2022-03-15 20:04:16.834968+00",
"key": null,
"locked_at": "2022-03-15 19:34:42.018201+00",
"locked_by": "worker-9a5ded96660316f8bc",
"revision": 0,
"flags": null
}
Instance #3
{
"id": 217337268,
"queue_name": "MatchingFillShift-8",
"task_identifier": "MatchingFillShift",
"payload": "{\"shift_id\":\"194242\",\"invites_to_send\":1}",
"priority": 0,
"run_at": "2022-03-15 20:04:16.834968+00",
"attempts": 10,
"max_attempts": 10,
"last_error": null,
"created_at": "2022-03-15 20:04:16.834968+00",
"updated_at": "2022-03-15 20:34:17.079496+00",
"key": null,
"locked_at": "2022-03-15 20:04:48.49702+00",
"locked_by": "worker-ce998a23f744072071",
"revision": 0,
"flags": null
}
Possible Solution
From Benjie
There’s probably a bug if the system crashes on the max attempt Because I think max attempts is bumped first, so it’s never reattempted, but it’s still locked Please file 👍 Specifically the system has to crash, which is why I’ve not seen it (*Crash or be killed)
There's a number of things we need to investigate here:
- max attempts is 10 even though updated_at is only 30 minutes (almost exactly) after
created_at
- how are locked jobs being reattempted so quickly? -
run_at
is thecreated_at
time - it seems that thoughattempts
was bumped,run_at
was not - we should fix this, maybe by bumpingrun_at
onget_job
(when we bump attempts) and also onfail_job
to accomodate the delay from executing the job. -
locked_at
is thecreated_at
timestamp, which means thatget_job
should be refusing to pick it up at all, and yet apparently it is not. - if a job hits its max attempts and the process exits prematurely then a) it will be locked, b) it will never be re-selected because it's max_attempts has been reached -> it stays locked forever
Before we dig any further... Are you using a custom get_jobs
? Or are you updating the jobs/job queues to null out the locked_at
or locked_by
columns or similar?
- No idea how max_attempts is 10. I suspect that its 1 when the job is started, then something else 30 mins later is changing it, but no idea what that would be.
Are you using a custom get_jobs?
Not to my knowledge
Or are you updating the jobs/job queues to null out the locked_at or locked_by columns or similar?
Not to my knowledge
I'm going to write a very generic test harness to try and reproduce without all our other stuff involved.
I think they were added with max_attemps
set to 10
Sorry I meant attempts not max attempts. It shouldn’t be able to reach 10 for at least 9 hours; 30 mins is way too fast.
We figured out what was happening, and it was unrelated to GraphileWorker. We were creating another pool and didn't have handlers correctly hooked up.
I'm reopening this because I want to investigate the things above anyway to make sure they're not issues.