queue icon indicating copy to clipboard operation
queue copied to clipboard

[18.0] Job batches stuck in In Progress

Open stamtos opened this issue 4 months ago • 5 comments

Module

queue_job_batch

Describe the bug

Job batches stuck in In Progress even when all are finished.

To Reproduce

Odoo 18.0 Enterprise

Steps to reproduce the behavior:

  1. Create a few Batch Jobs with more than 1 jobs.
  2. Wait a few hours. and randomly some batches will stuck in in progress

Expected behavior The batch should marked as finished when all jobs are finished

Additional context

Image Image

stamtos avatar Aug 05 '25 23:08 stamtos

Can you provide additional details, such as logs? I have encountered something like this in the past when my jobs took longer than --limit-time-* ^1 settings allowed.

amh-mw avatar Aug 08 '25 12:08 amh-mw

Hi @amh-mw, thank you for your response. I will increase the limit-time-* values and see if this helps.

I looked at the jobs in another batch, and all of them finished in the "done" state while taking less than the default limit-time-cpu value of 60 seconds. Additionally, the total time for all jobs was less than 60 seconds. So I do not think it is because of a timeout.

I do not see any warning or error logs.

here is my job pseudocode:

def process(self, param):
    _logger.info(f"Processing {len(self)} items with parameter {param}")
    url = self.env['ir.config_parameter'].sudo().get_param('service.url')  
    api_key = self.env['ir.config_parameter'].sudo().get_param('service.key')
    for rec in self:
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        }
        input_text = rec.name  
        data = {"input": input_text}  
        
        try:
            response = requests.post(url, headers=headers, json=data)
            response.raise_for_status()
            _logger.info(f"Response: {response.json()}")
            output = response.json().get("result", {})  
            
            value1 = output.get("key1", rec.value1)
            value2 = output.get("key2", rec.value2)
            
            data_dict = {
                'key1': value1,
                'key2': value2
            }
            
            rec.update_values('field', data_dict)
            
        except BaseException as e:
            _logger.error(f"Failed for input '{input_text}': {e}")
            raise RetryableJobError(f"Failed for input '{input_text}': {e}")

And here is the batch method:

def start(self):
    batch = self.env['queue.job.batch'].sudo().get_new_batch('Batch Process')  
    batch_size = 1  # Adjust as needed
    for i in range(0, len(self), batch_size):
        subset = self[i:i + batch_size]
        subset.sudo().with_context(job_batch=batch).with_delay(
            channel='root.process',
            description='Process subset',  
        ).process('param_value') 
    return True

PS: also tried adjusting the batch_size from 10 to 1

stamtos avatar Aug 20 '25 03:08 stamtos

found the root cause of this issue. a race condition in the batch state recompute when multiple jobs finish at the same time (and using at least 2 channels). This leaves some batches stuck in “In Progress” even though all jobs are “Done.”

To validate my hypothesis, I will implement some fixes in my codebase. After testing and validating them in my deployment environment, if this really fixes the problem, I will open a pull request with the changes.

https://github.com/OCA/queue/blob/2e0ed2cc9022eecc517315308ce25c87e0c214ec/queue_job_batch/models/queue_job.py#L22-L32

Image

another example:

Image

stamtos avatar Aug 20 '25 05:08 stamtos

Hi @stamtos any news on that one ? We face the same problem here, if your code solves the issue, would be interested to know / get a PR for that !

Thanks !

remi-filament avatar Sep 30 '25 10:09 remi-filament

@remi-filament, due to other work priorities, I haven't tested anything yet. As a workaround, we have a cron running check_state() periodically:

env['queue.job.batch'].search([('state', '=', 'progress')]).check_state()
env.cr.commit()

Hope it helps in the meantime.

stamtos avatar Oct 03 '25 10:10 stamtos