worker Can named queues become permanently locked when a job reaches max

Summary

We're heavily using 'Named Queues' (setting the queue_name column in jobs table) now and seeing issues with a job becoming locked and never getting unlocked. This blocks the whole queue.

Steps to reproduce

Add many jobs to the system with the graphile_worker.add_job function, and spread the jobs across many queue_name's. Example: {JobName-0} - {JobName-n} where N is maybe 20-30 (to push the system).

Run the system until you see a job that is 1) locked 2) has attempts = maxAttempts.

Expected results

Each queue should process the jobs as expected. Jobs should get locked, processed, and unlocked as expected.

Actual results

Out of N queues, one will become locked. The pattern were seeing is the queue gets locked and

created_at = run_at - This means the issue is occurring on the first run
attempts = max_attempts - so 25 = 25 if using the default
The job never gets unlocked

This MIGHT be the error thats occurring, but its not terribly helpful other than a Postgres connection is occuring.

Error: Connection terminated unexpectedly
    at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:509:28)
    at Connection.emit (node:events:390:28)
    at Connection.emit (node:domain:475:12)
    at TLSSocket.<anonymous> (/app/node_modules/pg/lib/connection.js:107:12)
    at TLSSocket.emit (node:events:402:35)
    at TLSSocket.emit (node:domain:475:12)
    at endReadableNT (node:internal/streams/readable:1343:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)
Emitted 'error' event on Client instance at:
    at Client._handleErrorEvent (/app/node_modules/pg/lib/client.js:319:10)
    at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:149:16)
    at Object.onceWrapper (node:events:509:28)
    [... lines matching original stack trace ...]
    at processTicksAndRejections (node:internal/process/task_queues:83:21)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed.
Exit code: 1
Command: /usr/local/bin/node
Arguments: /opt/yarn-v1.22.15/lib/cli.js start
Directory: /app/packages/worker
Output:

Additional context

I restarted the whole system at 19:02 (cleared the tables and restarted 16 worker processes). At 19:04 the first locking issue occurred.

Also, and this may be a coincidence, but we just noticed that:

Instance1's UpdatedAt = Instance2's CreatedAt
Instance2's UpdatedAt = Instance3's CreatedAt

I also searched Pod logs for worker-cec1a1fada9a164606 (lock from Instance #1) and nothing showed up - which signals the process died. Also, out of the 16 worker processes, there were 6 that had restarted (7 total restarts).

INSTANCE #1

  {
    "id": 217329284,
    "queue_name": "MatchingFillShift-4",
    "task_identifier": "MatchingFillShift",
    "payload": "{\"shift_id\":\"194369\",\"invites_to_send\":1}",
    "priority": 0,
    "run_at": "2022-03-15 19:04:16.385935+00",
    "attempts": 10,
    "max_attempts": 10,
    "last_error": null,
    "created_at": "2022-03-15 19:04:16.385935+00",
    "updated_at": "2022-03-15 19:34:10.955446+00",
    "key": null,
    "locked_at": "2022-03-15 19:04:47.846168+00",
    "locked_by": "worker-cec1a1fada9a164606",
    "revision": 0,
    "flags": null
  }

INSTANCE #2


  {
    "id": 217333027,
    "queue_name": "MatchingFillShift-3",
    "task_identifier": "MatchingFillShift",
    "payload": "{\"shift_id\":\"194528\",\"invites_to_send\":1}",
    "priority": 0,
    "run_at": "2022-03-15 19:34:10.955446+00",
    "attempts": 10,
    "max_attempts": 10,
    "last_error": null,
    "created_at": "2022-03-15 19:34:10.955446+00",
    "updated_at": "2022-03-15 20:04:16.834968+00",
    "key": null,
    "locked_at": "2022-03-15 19:34:42.018201+00",
    "locked_by": "worker-9a5ded96660316f8bc",
    "revision": 0,
    "flags": null
  }

Instance #3

  {
    "id": 217337268,
    "queue_name": "MatchingFillShift-8",
    "task_identifier": "MatchingFillShift",
    "payload": "{\"shift_id\":\"194242\",\"invites_to_send\":1}",
    "priority": 0,
    "run_at": "2022-03-15 20:04:16.834968+00",
    "attempts": 10,
    "max_attempts": 10,
    "last_error": null,
    "created_at": "2022-03-15 20:04:16.834968+00",
    "updated_at": "2022-03-15 20:34:17.079496+00",
    "key": null,
    "locked_at": "2022-03-15 20:04:48.49702+00",
    "locked_by": "worker-ce998a23f744072071",
    "revision": 0,
    "flags": null
  }

Possible Solution

From Benjie

There’s probably a bug if the system crashes on the max attempt Because I think max attempts is bumped first, so it’s never reattempted, but it’s still locked Please file 👍 Specifically the system has to crash, which is why I’ve not seen it (*Crash or be killed)

Mar 15 '22 22:03 gregberns

There's a number of things we need to investigate here:

max attempts is 10 even though updated_at is only 30 minutes (almost exactly) after created_at - how are locked jobs being reattempted so quickly?
run_at is the created_at time - it seems that though attempts was bumped, run_at was not - we should fix this, maybe by bumping run_at on get_job (when we bump attempts) and also on fail_job to accomodate the delay from executing the job.
locked_at is the created_at timestamp, which means that get_job should be refusing to pick it up at all, and yet apparently it is not.
if a job hits its max attempts and the process exits prematurely then a) it will be locked, b) it will never be re-selected because it's max_attempts has been reached -> it stays locked forever

Before we dig any further... Are you using a custom get_jobs? Or are you updating the jobs/job queues to null out the locked_at or locked_by columns or similar?

Mar 16 '22 09:03 benjie

No idea how max_attempts is 10. I suspect that its 1 when the job is started, then something else 30 mins later is changing it, but no idea what that would be.

Are you using a custom get_jobs?

Not to my knowledge

Or are you updating the jobs/job queues to null out the locked_at or locked_by columns or similar?

Not to my knowledge

I'm going to write a very generic test harness to try and reproduce without all our other stuff involved.

Mar 16 '22 22:03 gregberns

I think they were added with max_attemps set to 10

Mar 16 '22 23:03 johnhaley81

Sorry I meant attempts not max attempts. It shouldn’t be able to reach 10 for at least 9 hours; 30 mins is way too fast.

Mar 16 '22 23:03 benjie

We figured out what was happening, and it was unrelated to GraphileWorker. We were creating another pool and didn't have handlers correctly hooked up.

Mar 20 '22 06:03 gregberns

I'm reopening this because I want to investigate the things above anyway to make sure they're not issues.

Mar 22 '22 10:03 benjie

worker
worker copied to clipboard

Can named queues become permanently locked when a job reaches max_attempts?

Summary

Steps to reproduce

Expected results

Actual results

Additional context

Possible Solution

worker worker copied to clipboard

Can named queues become permanently locked when a job reaches max_attempts?

Summary

Steps to reproduce

Expected results

Actual results

Additional context

Possible Solution

worker
worker copied to clipboard