queue [10.0][IMP] queue_job: requeue zombie jobs after hard shutdown

Apr 11 '22 12:04 len-foss

Hi @guewen, some modules you are maintaining are being modified, check this out!

Apr 11 '22 12:04 OCA-git-bot

This is unfortunately not safe, because the job runner may be running on a different Odoo instance (possibly on another machine) than the jobs themselves. So it could be that the runner has crashed but the jobs are actually still running elsewhere.

Apr 13 '22 20:04 sbidoul

@sbidoul Wouldn't it be sufficient to skip locked rows, if the runner is actually working on these? For now, if the process crash late into the night, you might sometime in the morning get a call from end customer that things are not working, investigate a the poorly expressed report before finally fixing it. So now you are well into the work day, your products are not exported, you have 50K+ jobs to run, and you have to deal with transaction rollback in the following hours. There needs to be a solution for this.

Apr 14 '22 07:04 len-foss

Wouldn't it be sufficient to skip locked rows, if the runner is actually working on these?

That is an interesting idea. Currently the RunJobController does not keep a lock on job records while executing the job, but we could experiment reacquiring the lock around here.

Apr 14 '22 08:04 sbidoul

isn't this tackled by this cron https://github.com/OCA/queue/blob/e0c5096467e68714b1aa2459aa8ebf22418906c2/queue_job/models/queue_job.py#L279 ?

Apr 14 '22 08:04 simahawk

@simahawk I would assume if there's a need to have requeue_stuck_jobs in production, that means something is horribly wrong that should be fixed instead. The issue Stéphane mentions does not seem addressed by this method. Last, it seems to me that to use it you'd have to compute an upper-bound on the time a job is supposed to take, so in case there's an issue slowing down the process this is going to make things much worse.

Apr 14 '22 08:04 len-foss

@len-foss I wouldnt' say horrible :) it's a hard problem for which no-one contributed a complete solution so far.

@simahawk I did not know that cron. For detecting stuck enqueued jobs it is reasonable (although I have seen heavy load situations were 5 seconds is too short for enqueued jobs to start). It requires manual tuning to detect jobs that crashed and were left in started state (and that part is disabled by default, as expected).

So yeah @len-foss's idea looks interesting to me: keep a lock on started jobs, and have something in the job runner loop that resets started jobs on which there is no lock. This new lock might have unexpected effects and backward compatibility implications, not sure, to be tested.

Apr 14 '22 08:04 sbidoul

@sbidoul The situation you suggested is actually more complex than the one we have, so we won't really be able to test this patch.

Apr 19 '22 17:04 len-foss

@len-foss fair enough. Would you create an issue with your idea from https://github.com/OCA/queue/pull/423#issuecomment-1098793058 and close this PR ?

Apr 19 '22 17:04 sbidoul

@sbidoul Sorry I was not clear, I already pushed that on this PR, but I don't like the idea of giving code that I can't really test.

Apr 21 '22 07:04 len-foss

@sbidoul Sorry I was not clear, I already pushed that on this PR, but I don't like the idea of giving code that I can't really test.

Ah I had not seen your last commit. It looks good to me. So let's keep this PR around until someone has a chance to battle test it.

Apr 21 '22 07:04 sbidoul

@len-foss @sbidoul we tested this PR and we got a lot of blocked queries on queue_job when an exception occurs like:

a normal exception happening during the execution of the job
an usual "concurrent update" issue (that Odoo retries)

Each time we had to kill these blocked queries and requeue the jobs. Without this PR we don't face anymore any issue.

It could be reproduced again locally with jobs raising an exception I guess, we didn't take time for that.

Jul 08 '22 06:07 sebalix

@sebalix you actually tested before we had a chance to do it :)

I believe the behavior you observe is because in case of exception we try to write on the job in a different transaction and that creates a deadlock because the main transaction still holds a lock on the job record.

So yeah, this PR is not going to work as is. Resetting to draft.

Jul 13 '22 12:07 sbidoul

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

Nov 13 '22 12:11 github-actions[bot]