queue
queue copied to clipboard
[10.0][IMP] queue_job: requeue zombie jobs after hard shutdown
Hi @guewen, some modules you are maintaining are being modified, check this out!
This is unfortunately not safe, because the job runner may be running on a different Odoo instance (possibly on another machine) than the jobs themselves. So it could be that the runner has crashed but the jobs are actually still running elsewhere.
@sbidoul Wouldn't it be sufficient to skip locked rows, if the runner is actually working on these? For now, if the process crash late into the night, you might sometime in the morning get a call from end customer that things are not working, investigate a the poorly expressed report before finally fixing it. So now you are well into the work day, your products are not exported, you have 50K+ jobs to run, and you have to deal with transaction rollback in the following hours. There needs to be a solution for this.
Wouldn't it be sufficient to skip locked rows, if the runner is actually working on these?
That is an interesting idea. Currently the RunJobController does not keep a lock on job records while executing the job, but we could experiment reacquiring the lock around here.
isn't this tackled by this cron https://github.com/OCA/queue/blob/e0c5096467e68714b1aa2459aa8ebf22418906c2/queue_job/models/queue_job.py#L279 ?
@simahawk I would assume if there's a need to have requeue_stuck_jobs in production, that means something is horribly wrong that should be fixed instead. The issue Stéphane mentions does not seem addressed by this method. Last, it seems to me that to use it you'd have to compute an upper-bound on the time a job is supposed to take, so in case there's an issue slowing down the process this is going to make things much worse.
@len-foss I wouldnt' say horrible :) it's a hard problem for which no-one contributed a complete solution so far.
@simahawk I did not know that cron. For detecting stuck enqueued
jobs it is reasonable (although I have seen heavy load situations were 5 seconds is too short for enqueued jobs to start). It requires manual tuning to detect jobs that crashed and were left in started
state (and that part is disabled by default, as expected).
So yeah @len-foss's idea looks interesting to me: keep a lock on started jobs, and have something in the job runner loop that resets started jobs on which there is no lock. This new lock might have unexpected effects and backward compatibility implications, not sure, to be tested.
@sbidoul The situation you suggested is actually more complex than the one we have, so we won't really be able to test this patch.
@len-foss fair enough. Would you create an issue with your idea from https://github.com/OCA/queue/pull/423#issuecomment-1098793058 and close this PR ?
@sbidoul Sorry I was not clear, I already pushed that on this PR, but I don't like the idea of giving code that I can't really test.
@sbidoul Sorry I was not clear, I already pushed that on this PR, but I don't like the idea of giving code that I can't really test.
Ah I had not seen your last commit. It looks good to me. So let's keep this PR around until someone has a chance to battle test it.
@len-foss @sbidoul we tested this PR and we got a lot of blocked queries on queue_job
when an exception occurs like:
- a normal exception happening during the execution of the job
- an usual "concurrent update" issue (that Odoo retries)
Each time we had to kill these blocked queries and requeue the jobs. Without this PR we don't face anymore any issue.
It could be reproduced again locally with jobs raising an exception I guess, we didn't take time for that.
@sebalix you actually tested before we had a chance to do it :)
I believe the behavior you observe is because in case of exception we try to write on the job in a different transaction and that creates a deadlock because the main transaction still holds a lock on the job record.
So yeah, this PR is not going to work as is. Resetting to draft.
There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.