rq icon indicating copy to clipboard operation
rq copied to clipboard

cli requeue can fail if jobs reach TTL

Open AntiSol opened this issue 2 years ago • 2 comments

Hello

At line 115 of cli.py, you are trying to requeue jobs in the failed registry, and catching exceptions when that fails.

I ran into a situation where I had a large number of jobs in the failed registry and I tried to requeue them via the CLI, and I hit this error:

$ rq requeue --queue my_queue --all
Requeueing 344202 jobs from failed queue
Traceback (most recent call last):
  File "/usr/local/bin/rq", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rq/cli/cli.py", line 83, in wrapper
    return ctx.invoke(func, cli_config, *args[1:], **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rq/cli/cli.py", line 146, in requeue
    failed_job_registry.requeue(job_id)
  File "/usr/local/lib/python3.9/site-packages/rq/registry.py", line 120, in requeue
    job = self.job_class.fetch(job_or_id, connection=self.connection, serializer=serializer)
  File "/usr/local/lib/python3.9/site-packages/rq/job.py", line 372, in fetch
    job.refresh()
  File "/usr/local/lib/python3.9/site-packages/rq/job.py", line 617, in refresh
    raise NoSuchJobError('No such job: {0}'.format(self.key))
rq.exceptions.NoSuchJobError: No such job: b'rq:job:160f28f8-e6f3-4d93-8daf-c06475772d47'

I think what's probably happening here is that one or more failed jobs has hit their expiry time and been removed by redis in the time between getting the list of >300K job ids, and trying to re-queue them, so this particular ID didn't exist when the cli tool got around to trying to to re-queue it.

Ideally in this situation , the CLI tool should probably try to continue re-queueing other failed job, rather than throwing this error.

I was able to work around it in my case by editing line 115 of cli.py to simply catch any exception, and the cli utility was able to re-queue all the jobs that didn't expire. I expect a better fix than my quick hack would be to modify this line to also catch a NoSuchJobError :)

Thanks for your time.

AntiSol avatar Aug 10 '23 09:08 AntiSol

Hi @AntiSol

Allowing myself to ask a question on this issue report:

I think what's probably happening here is that one or more failed jobs has hit their expiry time and been removed by redis in the time between getting the list of >300K job ids, and trying to re-queue them, so this particular ID didn't exist when the cli tool got around to trying to to re-queue it.

The default TTL for failed jobs is 1 year (doc), and I'm experiencing similar behaviour ("no such job" when reading a registry) while the jobs I have created are clearly below the TTL (they're a day old). Do you know if you actually modified the TTL for the jobs you were talking about, or do we have some common behavior where failed jobs kind of disappear while they're still in the registry?

tchapi avatar Sep 18 '23 15:09 tchapi

Hi @tchapi. I'm talking about a fairly large system where we are setting custom timeouts for pretty much everything, our TTL is definitely not the default (we have it set to ~2 days iirc). I'm not aware of any jobs disappearing before hitting their TTL, sorry.

AntiSol avatar Sep 19 '23 01:09 AntiSol