worker icon indicating copy to clipboard operation
worker copied to clipboard

Custom job locked interval

Open Ashniu123 opened this issue 4 years ago • 5 comments

Summary

Currently, if a job failed unexpectedly, blocks out the job for 4 hours (as per documentation "What if something goes wrong?"). Is there a possibility that we can set it to some custom value - say 1 hour?

Additional context

Some critical tasks (short duration) run using graphile and would prefer shorter locked intervals. As a bonus, is there a way set locked_at per Runner, so that critical ones can be locked for short duration and long running ones for longer durations?


Thanks for building this amazing library based on SKIP LOCKED of pg. 🙌

Ashniu123 avatar Jun 11 '21 09:06 Ashniu123

This is in fact an option to get_job in the database schema (job_expiry):

https://github.com/graphile/worker/blob/5eb792b1f31ff0b41e3591b34e25e87b73ce9c4c/tests/schema.sql#L223

but alas it is not yet configurable via the JavaScript API or CLI.

Specifically this is where it would need to be added to the JS:

https://github.com/graphile/worker/blob/5eb792b1f31ff0b41e3591b34e25e87b73ce9c4c/src/worker.ts#L131-L144

One reason I've not exposed this yet is because it's critical that this number is the same for all processes, and I wasn't happy with the risk of the figure coming out of sync and causing weird issues. A PR to add this feature would have to come with sufficient warnings in place to help ensure people don't shoot themselves in the foot accidentally - maybe in the form of documentation, maybe in the form of programmatic warnings, I'm honestly not sure.

benjie avatar Jun 11 '21 10:06 benjie

Thanks for the quick reply @benjie really appreciate it! :raised_hands:

Continuing on the reply...

Does this mean that even the "4 hour" interval is just a random value that you had thought of? There is no real reason for the interval to be either 4 hour or 2 hour or even 30 mins right? The only consideration is that the job should finish before the interval, otherwise issues start popping up?

Ashniu123 avatar Jun 11 '21 11:06 Ashniu123

Pretty much, yeah. It's designed to allow restarting crashed processes that were unable to report success/failure. It's meant to be very much a fallback that should virtually never be exercised (except when servers die without shutting down cleanly).

benjie avatar Jun 11 '21 12:06 benjie

Hmmmm... To address my original question then, what do you think about the following:

  1. Modify the jobs table to have an additional column locked_interval which would have a default value of 4 hours (like your implementation), but could be override with any other time interval value too (point 3).
  2. Modify get_job to pick up job based on the row's locked_at + locked_interval instead of locked_at + 4 hours
  3. Update the add_job api with additional option lockedInterval which would fill the locked_interval column in the database. Through this value we can control the lock period at a per job level.

Do you believe these modifications are plausible? And would it impact the performance significantly enough that it can't be implemented?

Ashniu123 avatar Jun 12 '21 15:06 Ashniu123

I suppose we could create an index on locked_at + locked_interval. I'd rather a locked_until but I guess that wouldn't work so well on retries... This is a good way of solving the "everyone must share the same value" problem but you're the first person that's asked for this so I'm not 100% sure it's a need that's sufficiently universal to justify database schema modification. That said, I'm not sure there would be a (measurable) performance penalty, it's just a tuple that's 4 bytes larger and an index over a trivial arithmetic expression...

If you're interested in building this, I say give it a go (don't bother writing tests for it yet) and we can check out the performance impact with the perfTest script.

benjie avatar Jun 14 '21 09:06 benjie

I'm going to close this in favour of #222

benjie avatar Oct 24 '23 10:10 benjie