[18.0] Job batches stuck in In Progress
Module
queue_job_batch
Describe the bug
Job batches stuck in In Progress even when all are finished.
To Reproduce
Odoo 18.0 Enterprise
Steps to reproduce the behavior:
- Create a few Batch Jobs with more than 1 jobs.
- Wait a few hours. and randomly some batches will stuck in in progress
Expected behavior The batch should marked as finished when all jobs are finished
Additional context
Can you provide additional details, such as logs? I have encountered something like this in the past when my jobs took longer than --limit-time-* ^1 settings allowed.
Hi @amh-mw, thank you for your response. I will increase the limit-time-* values and see if this helps.
I looked at the jobs in another batch, and all of them finished in the "done" state while taking less than the default limit-time-cpu value of 60 seconds. Additionally, the total time for all jobs was less than 60 seconds. So I do not think it is because of a timeout.
I do not see any warning or error logs.
here is my job pseudocode:
def process(self, param):
_logger.info(f"Processing {len(self)} items with parameter {param}")
url = self.env['ir.config_parameter'].sudo().get_param('service.url')
api_key = self.env['ir.config_parameter'].sudo().get_param('service.key')
for rec in self:
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
input_text = rec.name
data = {"input": input_text}
try:
response = requests.post(url, headers=headers, json=data)
response.raise_for_status()
_logger.info(f"Response: {response.json()}")
output = response.json().get("result", {})
value1 = output.get("key1", rec.value1)
value2 = output.get("key2", rec.value2)
data_dict = {
'key1': value1,
'key2': value2
}
rec.update_values('field', data_dict)
except BaseException as e:
_logger.error(f"Failed for input '{input_text}': {e}")
raise RetryableJobError(f"Failed for input '{input_text}': {e}")
And here is the batch method:
def start(self):
batch = self.env['queue.job.batch'].sudo().get_new_batch('Batch Process')
batch_size = 1 # Adjust as needed
for i in range(0, len(self), batch_size):
subset = self[i:i + batch_size]
subset.sudo().with_context(job_batch=batch).with_delay(
channel='root.process',
description='Process subset',
).process('param_value')
return True
PS: also tried adjusting the batch_size from 10 to 1
found the root cause of this issue. a race condition in the batch state recompute when multiple jobs finish at the same time (and using at least 2 channels). This leaves some batches stuck in “In Progress” even though all jobs are “Done.”
To validate my hypothesis, I will implement some fixes in my codebase. After testing and validating them in my deployment environment, if this really fixes the problem, I will open a pull request with the changes.
https://github.com/OCA/queue/blob/2e0ed2cc9022eecc517315308ce25c87e0c214ec/queue_job_batch/models/queue_job.py#L22-L32
another example:
Hi @stamtos any news on that one ? We face the same problem here, if your code solves the issue, would be interested to know / get a PR for that !
Thanks !
@remi-filament, due to other work priorities, I haven't tested anything yet. As a workaround, we have a cron running check_state() periodically:
env['queue.job.batch'].search([('state', '=', 'progress')]).check_state()
env.cr.commit()
Hope it helps in the meantime.