arq
arq copied to clipboard
Status statistically return not found for job that just ended
When using ARQ as our job management system - we encountered a case where job status sampling returned "not found" even though the job existed. After investigating the issue, we found a race between the job's status function and the worker's finish_job function. The finish_job function updates the job's status on Redis from in progress to done using a single transaction. The race is created by the status function, which first checks whether the job is done on the Redis and, if it's not the case, then checks whether the job is in progress on the Redis. The problem occurs when the finish_job transaction is executed between the first status sampling and the second one. This race can only happen once per job, so our current patch solution is to recheck the status after receiving a not found answer, and if the answer is still not found, then it's not the described race. We would appreciate a more stable solution inside the library. Thank you!
Thanks for reporting.
I don't have time to dig into this right now, but have to review a PR if someone else has time.