django-q2
django-q2 copied to clipboard
When the next-in-line worker dies, the entire cluster stops processing tasks.
Reference to issue: Original issue 394
I still have my 1 worker cluster stop processing tasks after a timeout / reincarnation. Hope you have the possibility to look into this fault.
Here is an example:
01:51:13 [Q] INFO Process-1:1 processing [july-fix-arizona-leopard]
02:01:25 [Q] INFO Processed [july-fix-arizona-leopard]
02:20:54 [Q] INFO Enqueued 33803
02:20:54 [Q] INFO Process-1 created a task from schedule [xx]
02:20:54 [Q] INFO Process-1:1 processing [salami-eight-butter-paris]
02:23:30 [Q] INFO Processed [salami-eight-butter-paris]
[2022-10-15 02:23:57 +0200] [6] [CRITICAL] WORKER TIMEOUT (pid:10)
[2022-10-15 02:23:58 +0200] [6] [WARNING] Worker with pid 10 was terminated due to signal 9
[2022-10-15 02:23:58 +0200] [26] [INFO] Booting worker with pid: 26
02:30:41 [Q] ERROR reincarnated worker Process-1:1 after death
02:30:41 [Q] INFO Process-1:4 ready for work at 28
02:31:12 [Q] INFO Enqueued 33804
02:31:12 [Q] INFO Process-1 created a task from schedule [xx]
02:58:18 [Q] INFO Enqueued 33805
02:58:18 [Q] INFO Process-1 created a task from schedule [xx]
03:00:19 [Q] INFO Enqueued 33806
03:00:19 [Q] INFO Process-1 created a task from schedule [xx]
03:25:54 [Q] INFO Enqueued 33807
No longer processing any tasks after reincarnation of the 1 worker.
Thanks. I will take a look at this later this week.
So, the easiest solution would be to unlock it using either:
self.result_queue._rlock.release()
or
self.task_queue._rlock.release()
depending on which queue is locked. However, I am a bit hesitant as it might have side effects with multiple workers. The queue is not intended to be locked/unlocked manually - it's an internal function. Restarting the whole thing is also not something that I am liking as it might kill running tasks unexpectedly.
I might have to write a custom queue to get around this.
Hey, First of all thanks for creating the fork of django-q and looking into this issue!
We are experiencing the same bug where the cluster stops processing Queued Tasks after reincarnating workers. We could not figure out yet why the workers die and are being reincarnated. Do you know what could be reasons for that? We cannot see any logs that indicate errors or timeouts and tasks appear as successful tasks in the database.
ERROR reincarnated worker Process-1:1 after death
We are currently setting recycle=1
so that a worker gets immediately recycled after processing a task. Since that change we did not run into the issue of tasks piling up but that just serves as a hot fix.
Our settings:
- Broker: ORM
- Clusters: 2
- Workers: 4
- Recycle: 500
Why that process got killed, is hard to find out. It could be a different process that killed it.
It needs a pretty big update to fix:
The worker
, pusher
and monitor
all share queues. There are currently two queues: result_queue
and task_queue
. Queues get locked by the processes and then get released again later. The issue here is that when a process dies unexpectedly, the queue will never be released and will therefor stop functioning, even if the process gets restarted.
The simple fix is to unlock the queue when a process dies (explained in an earlier comment), however this could have some side effects.
The real fix would probably be to not rely on a queue that's shared between processes, but use pipes instead of move data from one process to another. If a pipe/process dies, then it doesn't affect other processes running.
Django-q(2) has been build around queues, so this is not something quick to fix. I will have to find some time to test that out.
FWIW: I didn't forget about this issue - I have fixed it (but it will take a bit more time before I can push it to the library). There is currently a draft PR here: https://github.com/django-q2/django-q2/pull/78 which will resolve this issue. It's pretty much a full rewrite of the library, going for a more class based approach, replacing queues with pipes between processes (which is what actually solves this issue) and a lot more.
Hi - checking in on any update for this issue? Thanks for the last update and all your support @GDay !
@johnwhelchel I have a little too much on my plate right now with freelance work. I will be back at this, but that will likely take a few more weeks before I get some more spare time.