Custom job locked interval
Summary
Currently, if a job failed unexpectedly, blocks out the job for 4 hours (as per documentation "What if something goes wrong?"). Is there a possibility that we can set it to some custom value - say 1 hour?
Additional context
Some critical tasks (short duration) run using graphile and would prefer shorter locked intervals. As a bonus, is there a way set locked_at per Runner, so that critical ones can be locked for short duration and long running ones for longer durations?
Thanks for building this amazing library based on SKIP LOCKED of pg. 🙌
This is in fact an option to get_job in the database schema (job_expiry):
https://github.com/graphile/worker/blob/5eb792b1f31ff0b41e3591b34e25e87b73ce9c4c/tests/schema.sql#L223
but alas it is not yet configurable via the JavaScript API or CLI.
Specifically this is where it would need to be added to the JS:
https://github.com/graphile/worker/blob/5eb792b1f31ff0b41e3591b34e25e87b73ce9c4c/src/worker.ts#L131-L144
One reason I've not exposed this yet is because it's critical that this number is the same for all processes, and I wasn't happy with the risk of the figure coming out of sync and causing weird issues. A PR to add this feature would have to come with sufficient warnings in place to help ensure people don't shoot themselves in the foot accidentally - maybe in the form of documentation, maybe in the form of programmatic warnings, I'm honestly not sure.
Thanks for the quick reply @benjie really appreciate it! :raised_hands:
Continuing on the reply...
Does this mean that even the "4 hour" interval is just a random value that you had thought of? There is no real reason for the interval to be either 4 hour or 2 hour or even 30 mins right?
The only consideration is that the job should finish before the interval, otherwise issues start popping up?
Pretty much, yeah. It's designed to allow restarting crashed processes that were unable to report success/failure. It's meant to be very much a fallback that should virtually never be exercised (except when servers die without shutting down cleanly).
Hmmmm... To address my original question then, what do you think about the following:
- Modify the
jobstable to have an additional columnlocked_intervalwhich would have a default value of 4 hours (like your implementation), but could be override with any other time interval value too (point 3). - Modify
get_jobto pick up job based on the row'slocked_at + locked_intervalinstead oflocked_at + 4 hours - Update the
add_jobapi with additional optionlockedIntervalwhich would fill thelocked_intervalcolumn in the database. Through this value we can control the lock period at a per job level.
Do you believe these modifications are plausible? And would it impact the performance significantly enough that it can't be implemented?
I suppose we could create an index on locked_at + locked_interval. I'd rather a locked_until but I guess that wouldn't work so well on retries... This is a good way of solving the "everyone must share the same value" problem but you're the first person that's asked for this so I'm not 100% sure it's a need that's sufficiently universal to justify database schema modification. That said, I'm not sure there would be a (measurable) performance penalty, it's just a tuple that's 4 bytes larger and an index over a trivial arithmetic expression...
If you're interested in building this, I say give it a go (don't bother writing tests for it yet) and we can check out the performance impact with the perfTest script.
I'm going to close this in favour of #222