Add number of attempts and next retry to logging on failed task
Feature description
Is it possible when a task fails, to print the number of attempts it is on and next retry date/time? Currently I only see:
[worker(worker-cdfafb0b60a41d39d9)] ERROR: Failed task 2 (email) with error Error (0.49ms):
Error
at module.exports (/var/myapp/worker/tasks/email.js:23:15)
at doNext (/var/myapp/worker/node_modules/graphile-worker/dist/worker.js:184:23)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
Requesting something like:
[worker(worker-cdfafb0b60a41d39d9)] ERROR: Failed task 2 (email) with error Error (0.49ms):
Error
at module.exports (/var/myapp/worker/tasks/email.js:23:15)
at doNext (/var/myapp/worker/node_modules/graphile-worker/dist/worker.js:184:23)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
Attempt [1 of 25], next retry in 3s.
[worker(worker-cdfafb0b60a41d39d9)] ERROR: Failed task 2 (email) with error Error (0.49ms):
Error
at module.exports (/var/myapp/worker/tasks/email.js:23:15)
at doNext (/var/myapp/worker/node_modules/graphile-worker/dist/worker.js:184:23)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
Attempt [2 of 25], next retry in 6s.
Motivating example
Make it easy to see how many attempts a tasks is on and when the next attempt will occur.
Supporting development
- [ ] am interested in building this feature myself
- [x] am interested in collaborating on building this feature
- [x] am willing to help testing this feature before it's released
- [ ] am willing to write a test-driven test suite for this feature (before it exists)
- [ ] am a Graphile sponsor ❤️
- [ ] have an active support or consultancy contract with Graphile
How many attempts a task is on: easy.
When it's next attempt will occur: more challenging.
Here's the relevant area of code that needs changing:
https://github.com/graphile/worker/blob/5a09a37e94f49de0d531569f6ca00d5dadd816a5/src/worker.ts#L262-L277
The job variable already knows the attempts so adding that to the log message should be simple. The "next attempt" is more challenging because it's not assigned until after the fail_job DB call. Ideally I'd rather not move the logging until after the DB call - the log should occur even if the DB call results in an error because we need traceability. Here's where the new run_at is assigned after failure:
https://github.com/graphile/worker/blob/5a09a37e94f49de0d531569f6ca00d5dadd816a5/tests/schema.sql#L210
I think our best bet is probably to just estimate it. Something like "Will run again in around {4 hours | 10 minutes | 8 seconds}" - find the order of the delay and then fuzz it to the nearest interval.
Why don't you take a stab at it, open a PR, and we can continue discussion there? :+1:
@benjie thanks for the comments and help. I opened a very simple starting PR draft at https://github.com/graphile/worker/pull/239. I could use some help calculating the next fuzz delay though.
Fixed in #239