django-q2
django-q2 copied to clipboard
Fix repeating task after timeout
Timeouts that happened with brokers that need acknowledging could have jobs that would keep retrying. This was caused due to the process being killed without sending the job back to the monitor
function that would mark the job as failed and would pull the task from the broker. This wasn't an issue with brokers that automatically remove the task from the broker when it gets pulled.
I have implemented a context manager that will keep track of the time in the worker itself (so not by the guard function). It will end the running task and return a timeout error. That gets caught and then it gets pushed to the monitor to process. It will also set the timeout to 0, so the guard can clean the process up and reincarnate a process (so we get a clean state for the next task).
Pull Request Test Coverage Report for Build 9636221326
Details
- 33 of 41 (80.49%) changed or added relevant lines in 3 files are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage decreased (-0.05%) to 89.683%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
django_q/worker.py | 10 | 13 | 76.92% |
django_q/timeout.py | 21 | 26 | 80.77% |
<!-- | Total: | 33 | 41 |
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
django_q/cluster.py | 1 | 91.3% |
<!-- | Total: | 1 |
Totals | |
---|---|
Change from base Build 9039900731: | -0.05% |
Covered Lines: | 3364 |
Relevant Lines: | 3751 |
💛 - Coveralls
@GDay I think there could ba an issue with that implementation, according to python docs signal.SIGALRM availability is Unix only.
Also, there is the risk that the task itself redefines SIGALRM which would then interfere with the mechanism. May be this can be worked out by just documenting it.
Hmm.. thanks for the heads-up @msabatier. You are correct.
The alternative, spinning off a thread with a timer, is not something I am keen to implement for multiple reasons.
So, this is interesting, looking at how other background workers do this:
Huey
doesn't support timeouts at all: https://github.com/coleifer/huey/issues/159
Django-rq
doesn't support Windows
Celery
doesn't support timeout in Windows: https://docs.celeryq.dev/en/latest/userguide/workers.html#time-limits
The only other option I see is making the workers class based and then having the guard (who kills the process) send the task to the monitor right before it kills the process. I will give that a go and see if that works out soon.
May be you could try to intercept the normal SIGTERM in the worker and do a clean shutdown there ?
One issue is that process.terminate()
does not send SIGTERM on Windows, may be os.kill() will do the job.
Getting this one in now, so we at least have something for catching these in Unix systems. For windows users, this issue will not be solved, but that's likely a minority of users right now.
Getting tracked here: https://github.com/django-q2/django-q2/issues/198