django-q icon indicating copy to clipboard operation
django-q copied to clipboard

Cluster unresponsive after worker is killed

Open krid opened this issue 8 years ago • 7 comments

I start up a cluster, wait a minute, then kill the worker processes. New workers are incarnated, but the cluster refuses to process new jobs (though it processed them just fine before). When I then try to shut down the cluster it hangs after killing the non-reincarnated members:

$ ./manage.py qcluster
15:19:35 [Q] INFO Q Cluster-7631 starting.
15:19:35 [Q] INFO Process-1:1 ready for work at 7635
15:19:35 [Q] INFO Process-1:2 ready for work at 7636
15:19:35 [Q] INFO Process-1:3 monitoring at 7637
15:19:35 [Q] INFO Process-1 guarding cluster at 7634
15:19:35 [Q] INFO Process-1:4 pushing tasks at 7638
15:19:35 [Q] INFO Q Cluster-7631 running.

(At this point I kill 7635 & 7636 from another window)

15:19:51 [Q] ERROR reincarnated worker Process-1:1 after death
15:19:51 [Q] INFO Process-1:5 ready for work at 7651
15:19:52 [Q] ERROR reincarnated worker Process-1:2 after death
15:19:52 [Q] INFO Process-1:6 ready for work at 7652

(Jobs submitted after this point are ignored by the cluster)

^C16:05:14 [Q] INFO Q Cluster-7631 stopping.
16:05:14 [Q] INFO Process-1 stopping cluster processes
16:05:14 [Q] INFO Process-1:4 stopped pushing tasks

(It's hung here, additional ctrl-C does nothing. Need to ctrl-Z and kill manually.)

^C16:05:17 [Q] INFO Q Cluster-7631 stopping.
^Z
[1]+  Stopped                 ./manage.py qcluster
(dash) $ kill %1
(dash) $ 16:05:25 [Q] INFO Q Cluster-7631 stopping.
16:05:25 [Q] INFO Q Cluster-7631 has stopped.
16:05:25 [Q] INFO Q Cluster-7631 has stopped.
16:05:25 [Q] INFO Q Cluster-7631 has stopped.

Configuration: Running the latest django-q from pip on Ubuntu 14.04, using the config below.

(dash) $ pip freeze -l
arrow==0.8.0
blessed==1.14.1
Django==1.9.1
django-debug-toolbar==1.3.2
django-mysql==1.0.1
django-picklefield==0.3.2
django-q==0.7.18
flufl.lock==2.4.1
future==0.15.2
ipython==3.1.0
mysqlclient==1.3.6
pyaml==15.3.1
python-dateutil==2.5.3
pytz==2015.2
PyYAML==3.11
requests==2.8.1
six==1.10.0
sqlparse==0.1.15
wcwidth==0.1.7
(dash) $ python -V
Python 3.4.3
(dash) $ uname -a
Linux orthrus 3.13.0-98-generic #145-Ubuntu SMP Sat Oct 8 20:13:07 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Q_CLUSTER = {
    'name': 'dash',
    'workers': 2,
    'recycle': 1,
    'timeout': 6000,
    'retry': 6060,
    'compress': False,
    'orm': 'default',
}
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
        'LOCATION': '/tmp/django_cache',
    }
}

krid avatar Oct 20 '16 23:10 krid

+1 Got the same issue

frank-u avatar Oct 31 '16 16:10 frank-u

+1 Got the same issue. My findings, after a reincarnate:

  • "Pusher" is still connected to the broker and thus receive messages from redis (or other)
  • "Pusher" pushes messages into the multi processing queue without error
  • Newly reincarnate Worker is waiting on the task_queue (multiprocessing.Queue) but never returns from task_queue.get

My workaround: I've added a scheduled task to push metric (some kind of ping) to Cloudwatch (I'm on AWS) and I trigger a restart of the whole cluster if I have no ping for more than X minutes ...

gchardon-hiventy avatar Nov 08 '16 14:11 gchardon-hiventy

Hi I'm having the same problem.

I want to be able to terminate the cluster but it hangs after I press Control+C:

^C13:56:07 [Q] INFO Q Cluster-18641 stopping.

raonyguimaraes avatar Feb 27 '18 13:02 raonyguimaraes

I am also seeing this same issue

kbuilds avatar Nov 01 '18 19:11 kbuilds

+1 Got the same issue

aliensowo avatar Jun 14 '22 08:06 aliensowo

I think this Error is showing a mistake in our code , may be some of the variable or argument is incorrect..

asifaliek avatar Jun 24 '22 11:06 asifaliek

Same issue for us also. Any news guys on this ?

stelios-gasparinatos avatar Dec 01 '22 14:12 stelios-gasparinatos