worker icon indicating copy to clipboard operation
worker copied to clipboard

Can named queues become permanently locked when a job reaches max_attempts?

Open gregberns opened this issue 2 years ago • 6 comments

Summary

We're heavily using 'Named Queues' (setting the queue_name column in jobs table) now and seeing issues with a job becoming locked and never getting unlocked. This blocks the whole queue.

Steps to reproduce

Add many jobs to the system with the graphile_worker.add_job function, and spread the jobs across many queue_name's. Example: {JobName-0} - {JobName-n} where N is maybe 20-30 (to push the system).

Run the system until you see a job that is 1) locked 2) has attempts = maxAttempts.

Expected results

Each queue should process the jobs as expected. Jobs should get locked, processed, and unlocked as expected.

Actual results

Out of N queues, one will become locked. The pattern were seeing is the queue gets locked and

  • created_at = run_at - This means the issue is occurring on the first run
  • attempts = max_attempts - so 25 = 25 if using the default
  • The job never gets unlocked

This MIGHT be the error thats occurring, but its not terribly helpful other than a Postgres connection is occuring.

Error: Connection terminated unexpectedly
    at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:509:28)
    at Connection.emit (node:events:390:28)
    at Connection.emit (node:domain:475:12)
    at TLSSocket.<anonymous> (/app/node_modules/pg/lib/connection.js:107:12)
    at TLSSocket.emit (node:events:402:35)
    at TLSSocket.emit (node:domain:475:12)
    at endReadableNT (node:internal/streams/readable:1343:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)
Emitted 'error' event on Client instance at:
    at Client._handleErrorEvent (/app/node_modules/pg/lib/client.js:319:10)
    at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:149:16)
    at Object.onceWrapper (node:events:509:28)
    [... lines matching original stack trace ...]
    at processTicksAndRejections (node:internal/process/task_queues:83:21)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed.
Exit code: 1
Command: /usr/local/bin/node
Arguments: /opt/yarn-v1.22.15/lib/cli.js start
Directory: /app/packages/worker
Output:

Additional context

I restarted the whole system at 19:02 (cleared the tables and restarted 16 worker processes). At 19:04 the first locking issue occurred.

Also, and this may be a coincidence, but we just noticed that:

  • Instance1's UpdatedAt = Instance2's CreatedAt
  • Instance2's UpdatedAt = Instance3's CreatedAt

I also searched Pod logs for worker-cec1a1fada9a164606 (lock from Instance #1) and nothing showed up - which signals the process died. Also, out of the 16 worker processes, there were 6 that had restarted (7 total restarts).

INSTANCE #1

  {
    "id": 217329284,
    "queue_name": "MatchingFillShift-4",
    "task_identifier": "MatchingFillShift",
    "payload": "{\"shift_id\":\"194369\",\"invites_to_send\":1}",
    "priority": 0,
    "run_at": "2022-03-15 19:04:16.385935+00",
    "attempts": 10,
    "max_attempts": 10,
    "last_error": null,
    "created_at": "2022-03-15 19:04:16.385935+00",
    "updated_at": "2022-03-15 19:34:10.955446+00",
    "key": null,
    "locked_at": "2022-03-15 19:04:47.846168+00",
    "locked_by": "worker-cec1a1fada9a164606",
    "revision": 0,
    "flags": null
  } 

INSTANCE #2


  {
    "id": 217333027,
    "queue_name": "MatchingFillShift-3",
    "task_identifier": "MatchingFillShift",
    "payload": "{\"shift_id\":\"194528\",\"invites_to_send\":1}",
    "priority": 0,
    "run_at": "2022-03-15 19:34:10.955446+00",
    "attempts": 10,
    "max_attempts": 10,
    "last_error": null,
    "created_at": "2022-03-15 19:34:10.955446+00",
    "updated_at": "2022-03-15 20:04:16.834968+00",
    "key": null,
    "locked_at": "2022-03-15 19:34:42.018201+00",
    "locked_by": "worker-9a5ded96660316f8bc",
    "revision": 0,
    "flags": null
  }  

Instance #3

  {
    "id": 217337268,
    "queue_name": "MatchingFillShift-8",
    "task_identifier": "MatchingFillShift",
    "payload": "{\"shift_id\":\"194242\",\"invites_to_send\":1}",
    "priority": 0,
    "run_at": "2022-03-15 20:04:16.834968+00",
    "attempts": 10,
    "max_attempts": 10,
    "last_error": null,
    "created_at": "2022-03-15 20:04:16.834968+00",
    "updated_at": "2022-03-15 20:34:17.079496+00",
    "key": null,
    "locked_at": "2022-03-15 20:04:48.49702+00",
    "locked_by": "worker-ce998a23f744072071",
    "revision": 0,
    "flags": null
  }  

Possible Solution

From Benjie

There’s probably a bug if the system crashes on the max attempt Because I think max attempts is bumped first, so it’s never reattempted, but it’s still locked Please file 👍 Specifically the system has to crash, which is why I’ve not seen it (*Crash or be killed)

gregberns avatar Mar 15 '22 22:03 gregberns

There's a number of things we need to investigate here:

  1. max attempts is 10 even though updated_at is only 30 minutes (almost exactly) after created_at - how are locked jobs being reattempted so quickly?
  2. run_at is the created_at time - it seems that though attempts was bumped, run_at was not - we should fix this, maybe by bumping run_at on get_job (when we bump attempts) and also on fail_job to accomodate the delay from executing the job.
  3. locked_at is the created_at timestamp, which means that get_job should be refusing to pick it up at all, and yet apparently it is not.
  4. if a job hits its max attempts and the process exits prematurely then a) it will be locked, b) it will never be re-selected because it's max_attempts has been reached -> it stays locked forever

Before we dig any further... Are you using a custom get_jobs? Or are you updating the jobs/job queues to null out the locked_at or locked_by columns or similar?

benjie avatar Mar 16 '22 09:03 benjie

  1. No idea how max_attempts is 10. I suspect that its 1 when the job is started, then something else 30 mins later is changing it, but no idea what that would be.

Are you using a custom get_jobs?

Not to my knowledge

Or are you updating the jobs/job queues to null out the locked_at or locked_by columns or similar?

Not to my knowledge

I'm going to write a very generic test harness to try and reproduce without all our other stuff involved.

gregberns avatar Mar 16 '22 22:03 gregberns

I think they were added with max_attemps set to 10

johnhaley81 avatar Mar 16 '22 23:03 johnhaley81

Sorry I meant attempts not max attempts. It shouldn’t be able to reach 10 for at least 9 hours; 30 mins is way too fast.

benjie avatar Mar 16 '22 23:03 benjie

We figured out what was happening, and it was unrelated to GraphileWorker. We were creating another pool and didn't have handlers correctly hooked up.

gregberns avatar Mar 20 '22 06:03 gregberns

I'm reopening this because I want to investigate the things above anyway to make sure they're not issues.

benjie avatar Mar 22 '22 10:03 benjie