worker icon indicating copy to clipboard operation
worker copied to clipboard

Add number of attempts and next retry to logging on failed task

Open nodesocket opened this issue 4 years ago • 2 comments

Feature description

Is it possible when a task fails, to print the number of attempts it is on and next retry date/time? Currently I only see:

[worker(worker-cdfafb0b60a41d39d9)] ERROR: Failed task 2 (email) with error Error (0.49ms):
  Error
    at module.exports (/var/myapp/worker/tasks/email.js:23:15)
    at doNext (/var/myapp/worker/node_modules/graphile-worker/dist/worker.js:184:23)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)

Requesting something like:

[worker(worker-cdfafb0b60a41d39d9)] ERROR: Failed task 2 (email) with error Error (0.49ms):
  Error
    at module.exports (/var/myapp/worker/tasks/email.js:23:15)
    at doNext (/var/myapp/worker/node_modules/graphile-worker/dist/worker.js:184:23)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
Attempt [1 of 25], next retry in 3s.

[worker(worker-cdfafb0b60a41d39d9)] ERROR: Failed task 2 (email) with error Error (0.49ms):
  Error
    at module.exports (/var/myapp/worker/tasks/email.js:23:15)
    at doNext (/var/myapp/worker/node_modules/graphile-worker/dist/worker.js:184:23)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
Attempt [2 of 25], next retry in 6s.

Motivating example

Make it easy to see how many attempts a tasks is on and when the next attempt will occur.

Supporting development

  • [ ] am interested in building this feature myself
  • [x] am interested in collaborating on building this feature
  • [x] am willing to help testing this feature before it's released
  • [ ] am willing to write a test-driven test suite for this feature (before it exists)
  • [ ] am a Graphile sponsor ❤️
  • [ ] have an active support or consultancy contract with Graphile

nodesocket avatar Nov 10 '21 01:11 nodesocket

How many attempts a task is on: easy.

When it's next attempt will occur: more challenging.

Here's the relevant area of code that needs changing:

https://github.com/graphile/worker/blob/5a09a37e94f49de0d531569f6ca00d5dadd816a5/src/worker.ts#L262-L277

The job variable already knows the attempts so adding that to the log message should be simple. The "next attempt" is more challenging because it's not assigned until after the fail_job DB call. Ideally I'd rather not move the logging until after the DB call - the log should occur even if the DB call results in an error because we need traceability. Here's where the new run_at is assigned after failure:

https://github.com/graphile/worker/blob/5a09a37e94f49de0d531569f6ca00d5dadd816a5/tests/schema.sql#L210

I think our best bet is probably to just estimate it. Something like "Will run again in around {4 hours | 10 minutes | 8 seconds}" - find the order of the delay and then fuzz it to the nearest interval.

Why don't you take a stab at it, open a PR, and we can continue discussion there? :+1:

benjie avatar Nov 11 '21 10:11 benjie

@benjie thanks for the comments and help. I opened a very simple starting PR draft at https://github.com/graphile/worker/pull/239. I could use some help calculating the next fuzz delay though.

nodesocket avatar Nov 15 '21 22:11 nodesocket

Fixed in #239

benjie avatar Oct 24 '23 10:10 benjie