django-q2 icon indicating copy to clipboard operation
django-q2 copied to clipboard

Fix repeating task after timeout

Open GDay opened this issue 9 months ago • 4 comments

Timeouts that happened with brokers that need acknowledging could have jobs that would keep retrying. This was caused due to the process being killed without sending the job back to the monitor function that would mark the job as failed and would pull the task from the broker. This wasn't an issue with brokers that automatically remove the task from the broker when it gets pulled.

I have implemented a context manager that will keep track of the time in the worker itself (so not by the guard function). It will end the running task and return a timeout error. That gets caught and then it gets pushed to the monitor to process. It will also set the timeout to 0, so the guard can clean the process up and reincarnate a process (so we get a clean state for the next task).

GDay avatar May 05 '24 00:05 GDay

Pull Request Test Coverage Report for Build 9636221326

Details

  • 33 of 41 (80.49%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.05%) to 89.683%

Changes Missing Coverage Covered Lines Changed/Added Lines %
django_q/worker.py 10 13 76.92%
django_q/timeout.py 21 26 80.77%
<!-- Total: 33 41
Files with Coverage Reduction New Missed Lines %
django_q/cluster.py 1 91.3%
<!-- Total: 1
Totals Coverage Status
Change from base Build 9039900731: -0.05%
Covered Lines: 3364
Relevant Lines: 3751

💛 - Coveralls

coveralls avatar May 05 '24 00:05 coveralls

@GDay I think there could ba an issue with that implementation, according to python docs signal.SIGALRM availability is Unix only.

Also, there is the risk that the task itself redefines SIGALRM which would then interfere with the mechanism. May be this can be worked out by just documenting it.

msabatier avatar May 06 '24 11:05 msabatier

Hmm.. thanks for the heads-up @msabatier. You are correct.

The alternative, spinning off a thread with a timer, is not something I am keen to implement for multiple reasons.

So, this is interesting, looking at how other background workers do this: Huey doesn't support timeouts at all: https://github.com/coleifer/huey/issues/159 Django-rq doesn't support Windows Celery doesn't support timeout in Windows: https://docs.celeryq.dev/en/latest/userguide/workers.html#time-limits

The only other option I see is making the workers class based and then having the guard (who kills the process) send the task to the monitor right before it kills the process. I will give that a go and see if that works out soon.

GDay avatar May 06 '24 15:05 GDay

May be you could try to intercept the normal SIGTERM in the worker and do a clean shutdown there ? One issue is that process.terminate() does not send SIGTERM on Windows, may be os.kill() will do the job.

msabatier avatar May 06 '24 17:05 msabatier

Getting this one in now, so we at least have something for catching these in Unix systems. For windows users, this issue will not be solved, but that's likely a minority of users right now.

Getting tracked here: https://github.com/django-q2/django-q2/issues/198

GDay avatar Jun 23 '24 20:06 GDay