okuna-api
okuna-api copied to clipboard
supervisor program:rqwoker dies and stalls processing jobs
Today we had a queue of 500+ jobs not being processed by rq workers.
The second machine was running a couple worker processes.
root 1104 0.0 1.9 395228 77472 ? S Aug15 0:21 python manage.py rqworker default --pid /var/run/rqworker
root 1756 0.0 1.9 395228 77580 ? S Aug14 0:22 python manage.py rqworker default --pid /var/run/rqworker
root 3195 0.0 1.9 395284 77416 ? S Aug23 0:09 python manage.py rqworker default --pid /var/run/rqworker
root 5229 0.0 1.9 395228 77676 ? S Aug14 0:22 python manage.py rqworker default --pid /var/run/rqworker
root 6642 0.0 1.9 395228 77496 ? S Aug14 0:24 python manage.py rqworker default --pid /var/run/rqworker
root 7178 0.0 1.9 395228 77504 ? S Aug15 0:21 python manage.py rqworker default --pid /var/run/rqworker
root 8669 0.0 1.9 395228 77632 ? S Aug14 0:21 python manage.py rqworker default --pid /var/run/rqworker
root 12302 0.0 1.9 395228 77704 ? S Aug14 0:24 python manage.py rqworker default --pid /var/run/rqworker
root 12460 0.0 1.9 395228 77616 ? S Aug15 0:20 python manage.py rqworker default --pid /var/run/rqworker
root 13779 0.0 0.0 115272 3200 ? S Aug28 0:00 /bin/bash -c source /opt/python/current/env && source /opt/p
But supervisor wasn't aware of a running program:rqworker. I tried to stop the service and got
/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf -s unix:///opt/python/run/supervisor.sock stop rqworker
rqworker: ERROR (not running)
Once I started it
/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf -s unix:///opt/python/run/supervisor.sock start rqworker
Jobs started to process again.
Perhaps related to https://github.com/rq/rq/issues/758
Is our supervisor config for rqworker correct?
We MUST ensure that this doesn't happen again as we will now use rq workers to process post media.
If they stall, no posting will be possible.
@evict halp
Something that might be is that there was a redeploy? We have a script which post deploy restarts supervisord with the custom config including the djangorq program.
Perhaps this doesn't get run in all situations?
files:
files:
"/opt/elasticbeanstalk/hooks/appdeploy/post/04_update_supervisor.sh":
mode: "000755"
owner: root
group: root
content: |
#!/usr/bin/env bash
/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf -s unix:///opt/python/run/supervisor.sock reload
Nope, there was no redeploy at that time. On the 28th of August was the last one. There is no error in the logging whatsoever.