cmdstalk icon indicating copy to clipboard operation
cmdstalk copied to clipboard

Jobs that timeout will never be able to run again

Open lwc opened this issue 10 years ago • 6 comments

When a job overruns it's TTR, beanstalkd will increment the job's timeout stat and put it back on the work queue for another worker to reserve.

In an effort to prevent pathological jobs from dog-piling all available workers, cmdstalk will bury a task it reserves that has timeouts greater than 1. This means that once a task is buried because of a timeout, it will always re-bury instantly each time it is kicked: the job becomes un-runnable.

Using just the buried, kicked and timeout counters, there does not appear to be a way to differentiate between "kicks due buries due to timeouts" in the way that would allow cmdstalk to bury a job the next time it is reserved after a timeout.

The beanstalkd protocol docs make mention of a one second grace period at the end of a reserve time - would it be possible to use this grace period to bury a timed out job in the "same run" as the timeout occurred?

lwc avatar Aug 27 '14 05:08 lwc

Upon further reading I'm less clear on how DEADLINE_SOON is meant to operate :confused:

lwc avatar Aug 27 '14 05:08 lwc

DEADLINE_SOON is sent to a client that is in a blocking reserve if there are no other jobs to send it and a job it has is nearing TTR deadline.

The issue with racing beanstalkd to bury a task is that you miss out on the timed-out metadata. It's simply buried, if you beat the server to it.

Perhaps it's just the fact that the job is buried on timeout? What should actually happen to timed-out jobs? If we just kick them at the minute then perhaps we should change the behaviour to release with delay to prevent dog-piling. Perhaps timeouts could result in a more aggressive exponential backoff, or a more premature bury.

Either way, seems like we haven't got it 100% right. Thoughts @pda @rbone?

lox avatar Sep 11 '14 05:09 lox

A longer backoff sounds like a reasonable change for the moment. It is tricky however, as some tasks may merit more aggressive burying strategies while others may be safe to retry very frequently. I'd say a longer backoff makes sense as a default, but it might be nice in the future to make this behaviour configurable, possibly even on a per tube basis.

rbone avatar Sep 11 '14 05:09 rbone

Should the backoff be proportional to the TTR?

lox avatar Sep 11 '14 05:09 lox

Honestly I can't make up my mind on what the default behaviour should be, so it probably doesn't matter too much what way you go. A proportional TTR should be fine. I think having it be configurable per tube will become pretty important however.

rbone avatar Sep 14 '14 23:09 rbone

I think a simple function of the try count c should work fine for now.

PR #4 proposes 3 tries with c*c * time.Hour; delays are 0 (first try), 1 hour, 4 hours; total of 5 hours. 4 tries at c*c * time.Hour could also work; that would add an extra retry after an additional 9 hours.

pda avatar Sep 15 '14 17:09 pda