arq icon indicating copy to clipboard operation
arq copied to clipboard

Unique cron executes multiple times when running multiple workers

Open Tinche opened this issue 5 years ago • 6 comments

Hi,

we're seeing this in our services. I believe the problem can be summarized as follows.

Poll iterations are done every 500 ms. This is a long time, relatively speaking.

Worker 0, at t0: 0) Poll the queue, enqueue the cron.

Worker 1, at t0 + 10ms:

  1. Poll the queue, get the cron job ID
  2. Check in_progress_key, queue score. Start the job
  3. Job finishes really quickly (~1 ms)
  4. The cleanup atomically deletes the in_progress_key, retry key, job key, queue score

Worker 2, at t0 + 20ms: 5) Poll the queue, schedule the cron

Worker 3, at t0 + 30ms: 6) Poll the queue, get the cron job ID 7) Check in_progress_key, queue score. Everything is in order since the last execution cleaned up after itself. Start the job

I'm willing to contribute a fix for this, but let's discuss it first.

We should probably have a way of storing the fact that a cron job finished executing successfully. (I'm not sure how retries are handled for unique crons, so I'm ignoring that case right now.) The simplest way of doing this would be to not delete the in_progress key, but set it to expire in a number of seconds. (The number should be greater than the poll duration, and should probably be greater than average clock drift if the servers don't have NTP configured. Let's say 60 seconds.)

I think this change alone would fix this issue. To be a little more efficient though, we could also check for the existence of this key while enqueueing the cron, so not every worker process enqueues it separately, to save on a little work. Since this is an optimization, it doesn't have to be perfectly free of race conditions.

Tinche avatar Jul 01 '20 23:07 Tinche

Maybe there should be master/slave elections between the workers. Only the master will be responsible for enqueuing (also consuming). The raft algorithm may be useful here. https://raft.github.io/

aleksarias avatar Aug 21 '20 14:08 aleksarias

sorry for the slow response on this.

I don't really want to build raft into arq if I can help it, especially since (off the top of my head) I think it's only required for enqueuing cron jobs.

An atomic way to make sure the job is only enqueued once seems like the best route to me, especially as we have similar logic for making sure jobs are only executed once - we might even be able to extract the logic into a function and reuse it.

Happy to review a PR, though (as you might have guessed from my woefully tardy response here) I'm pretty busy right now.

samuelcolvin avatar Aug 21 '20 16:08 samuelcolvin

Yeah I agree this can be solved in a much simpler way than bringing in Raft. I have a proof of concept ready to test in our codebase, then I can submit a PR.

My current approach is to enqueue the job before the cron deadline. That way, if you're running N workers, they will all enqueue but all enqueues after the first will be idempotent. This runs into a corner case with poll_delay being 0 though (I don't know if people run Arq with poll_delay 0, seems like a good way of killing your cpu?).

Going further by this logic, the workers don't even need to enqueue the cron, just try executing it when the deadline comes. The usual Arq machinery will take care of letting only one worker proceed. This basically moves some load from the queue to the in-progress key.

I would still like to implement a check that would extend the life of the in-progress key for unique crons artificially after the cron ends, to account for workers that might have skewed clocks or are blocking by something for whatever reason.

Tinche avatar Aug 21 '20 17:08 Tinche

I haven't looked closely at how arq treats cron jobs in a while, but how about:

Cron jobs are just normal jubs but with a job ID that includes the cron function name and the time, thus existing logic should prevent the job being enqueued twice, also existing logic would prevent it being executed twice.

All workers can try to create the next cron job (for each function) when they start, the worker that executed a cron job can take care of enqueuing the next job before it runs the current job.

Does that make sense?

samuelcolvin avatar Aug 21 '20 18:08 samuelcolvin

Cron jobs are just normal jubs but with a job ID that includes the cron function name and the time

That's how it works right now for unique cron jobs, yep.

existing logic should prevent the job being enqueued twice, also existing logic would prevent it being executed twice

This is not the case right now, as described in the original post.

All workers can try to create the next cron job (for each function) when they start, the worker that executed a cron job can take care of enqueuing the next job before it runs the current job.

That's an interesting approach that could work. A problem I see with this is that it's fragile to interruptions in a critical section - if a worker dies or loses connectivity to Redis at a certain point, it won't be able to enqueue the next iteration. So the cron job will not work until a new worker starts.

Tinche avatar Aug 21 '20 23:08 Tinche

Hey, also ran into this issue when running a unique cron job with 4 arq workers (job runs two times in my case). Seems like there is a pull request ready to fix this issue, any plans on merging it?

havardthom avatar Feb 12 '21 14:02 havardthom