dramatiq icon indicating copy to clipboard operation
dramatiq copied to clipboard

Worker restarts automatically

Open tlmn-local opened this issue 2 years ago • 6 comments

Issues

I have created workers on two servers and according to the logs, the workers on both servers are restarting automatically at the same time, I don't understand why this is.

logs:

21:48:01,102 [ALERT] export:64 - The subdomain result for jd.co.th: /root/Gleam/plugins/domain/oneforall/results/jd.co.th.csv
21:48:01,102 [INFOR] oneforall:258 - Finished OneForAll
[2022-01-22 21:48:01,149] [PID 16211] [Thread-8] [tasks.normal.domain.oneforall_domain_task.oneforall_task] [DEBUG] Completed after 38457.04ms.
[2022-01-22 21:48:01,149] [PID 16211] [Thread-8] [dramatiq.worker.ConsumerThread(domain)] [DEBUG] Acknowledging message '1ae067af-1bf7-4c30-99fb-4499bf8720b0'.
[2022-01-22 22:10:50,444] [PID 16210] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Stopping worker process...
[2022-01-22 22:10:50,443] [PID 16211] [MainThread] [dramatiq.WorkerProcess(1)] [INFO] Stopping worker process...
[2022-01-22 22:10:50,444] [PID 16244] [MainThread] [dramatiq.ForkProcess(0)] [INFO] Stopping fork process...
[2022-01-22 22:10:50,446] [PID 16209] [MainThread] [dramatiq.MainProcess] [INFO] Sending signal 1 to subprocesses...
[2022-01-22 22:10:50,446] [PID 16210] [MainThread] [dramatiq.WorkerProcess(0)] [WARNING] Killing worker process...
[2022-01-22 22:10:50,446] [PID 16211] [MainThread] [dramatiq.WorkerProcess(1)] [WARNING] Killing worker process...
[2022-01-22 22:10:52,059] [PID 16209] [MainThread] [dramatiq.MainProcess] [INFO] Dramatiq '1.12.2' is booting up.
[2022-01-22 22:10:51,524] [PID 22623] [MainThread] [dramatiq.WorkerProcess(0)] [DEBUG] Loading broker...
[2022-01-22 22:10:51,983] [PID 22623] [MainThread] [dramatiq.middleware.prometheus.Prometheus] [DEBUG] Setting up metrics...
[2022-01-22 22:10:51,983] [PID 22623] [MainThread] [dramatiq.WorkerProcess(0)] [DEBUG] Loading modules...
[2022-01-22 22:10:52,054] [PID 22623] [MainThread] [dramatiq.WorkerProcess(0)] [DEBUG] Sending forks to main process...
[2022-01-22 22:10:52,055] [PID 22623] [MainThread] [dramatiq.WorkerProcess(0)] [DEBUG] Starting worker threads...
[2022-01-22 22:10:52,055] [PID 22623] [MainThread] [dramatiq.worker._WorkerMiddleware] [DEBUG] Adding consumer for queue 'domain'.
[2022-01-22 22:10:52,055] [PID 22623] [Thread-3] [dramatiq.worker.ConsumerThread(domain)] [DEBUG] Running consumer thread...
[2022-01-22 22:10:52,055] [PID 22623] [MainThread] [dramatiq.worker._WorkerMiddleware] [DEBUG] Adding consumer for delay queue 'domain.DQ'.
[2022-01-22 22:10:52,055] [PID 22623] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Running consumer thread...
[2022-01-22 22:10:52,056] [PID 22623] [Thread-5] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22623] [Thread-6] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22623] [Thread-7] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22623] [Thread-8] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22623] [Thread-9] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22623] [Thread-10] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22623] [Thread-11] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22623] [Thread-12] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22623] [MainThread] [dramatiq.WorkerProcess(0)] [INFO] Worker process is ready for action.
[2022-01-22 22:10:51,526] [PID 22624] [MainThread] [dramatiq.WorkerProcess(1)] [DEBUG] Loading broker...
[2022-01-22 22:10:51,983] [PID 22624] [MainThread] [dramatiq.middleware.prometheus.Prometheus] [DEBUG] Setting up metrics...
[2022-01-22 22:10:51,983] [PID 22624] [MainThread] [dramatiq.WorkerProcess(1)] [DEBUG] Loading modules...
[2022-01-22 22:10:52,055] [PID 22624] [MainThread] [dramatiq.WorkerProcess(1)] [DEBUG] Starting worker threads...
[2022-01-22 22:10:52,055] [PID 22624] [MainThread] [dramatiq.worker._WorkerMiddleware] [DEBUG] Adding consumer for queue 'domain'.
[2022-01-22 22:10:52,055] [PID 22624] [Thread-3] [dramatiq.worker.ConsumerThread(domain)] [DEBUG] Running consumer thread...
[2022-01-22 22:10:52,055] [PID 22624] [MainThread] [dramatiq.worker._WorkerMiddleware] [DEBUG] Adding consumer for delay queue 'domain.DQ'.
[2022-01-22 22:10:52,055] [PID 22624] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Running consumer thread...
[2022-01-22 22:10:52,056] [PID 22624] [Thread-5] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22624] [Thread-6] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22624] [Thread-7] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,056] [PID 22624] [Thread-8] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22624] [Thread-9] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22624] [Thread-10] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22624] [Thread-11] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22624] [Thread-12] [dramatiq.worker.WorkerThread] [DEBUG] Running worker thread...
[2022-01-22 22:10:52,057] [PID 22624] [MainThread] [dramatiq.WorkerProcess(1)] [INFO] Worker process is ready for action.
[2022-01-22 22:10:52,059] [PID 22657] [MainThread] [dramatiq.ForkProcess(0)] [DEBUG] Loading fork function...
[2022-01-22 22:10:52,059] [PID 22657] [MainThread] [dramatiq.ForkProcess(0)] [INFO] Fork process 'dramatiq.middleware.prometheus:_run_exposition_server' is ready for action.
[2022-01-22 22:10:52,060] [PID 22657] [MainThread] [dramatiq.middleware.prometheus._run_exposition_server] [DEBUG] Starting exposition server...
[2022-01-22 22:52:05,127] [PID 22623] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Pushing message 'c7b6dcac-a064-4532-8777-f61532e32a49' onto delay queue.
[2022-01-22 22:52:05,127] [PID 22623] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Pushing message 'fb4aad49-ae9f-41ce-a330-2fb028852226' onto delay queue.
[2022-01-22 22:52:05,127] [PID 22623] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Pushing message 'dbfb8b8e-2939-4eaf-9c74-e96fa8204e1e' onto delay queue.
[2022-01-22 22:52:05,128] [PID 22623] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Pushing message 'c20d2b37-3955-4ab0-b6e6-5da741779b3f' onto delay queue.
[2022-01-22 22:52:05,128] [PID 22623] [Thread-4] [dramatiq.worker.ConsumerThread(domain.DQ)] [DEBUG] Pushing message '855b2581-e623-40b2-827e-0eefdfb23ce0' onto delay queue.

As you can see, the worker that was still executing the task at 21 was automatically restarted at 22.

tlmn-local avatar Jan 22 '22 17:01 tlmn-local

After debugging, I found that the following method is automatically called:

def termhandler(signum, frame):
    nonlocal running
    if running:
        logger.info("Stopping worker process...")
        running = False
    else:
        logger.warning("Killing worker process...")
        return sys.exit(RET_KILLED)

logger.info("Worker process is ready for action.")
signal.signal(signal.SIGINT, signal.SIG_IGN)
signal.signal(signal.SIGTERM, termhandler)
if hasattr(signal, "SIGHUP"):
    signal.signal(signal.SIGHUP, termhandler)
if hasattr(signal, "SIGBREAK"):
    signal.signal(signal.SIGBREAK, termhandler)

Is there a possibility that these signals are generated while the program is running? non-artificial behavior.

tlmn-local avatar Jan 27 '22 18:01 tlmn-local

I'm experiencing same problem. I'm running dramatiq-gevent (1.10.0) inside docker container on DIgitalOcean droplet.

CMD ["dramatiq-gevent", "tasks", "-p", "8", "-t", "60", "-v"]

After I start container, dramatiq starts processing messages. After 1-2 minutes, this appears in logs:

Stopping worker process...

Which means termhandler was called. I have 8 processes and term handler seems to be called only 7 times. And after ~12 minutes workers shut down

Channels and connections closed

And restarted

Running worker thread

This happened multiple times today. So, example timeline: 23:49:25 Started container 23:51:45 Got signal (Stopping worker process...) 00:04:31 Workers stopped (Channels and connections closed) 00:05:10 Workers restarted (Running worker thread)

I have no idea why this is happening :( I use dramatiq to perform really long tasks and they aren't idempotent, so this breaks all my code :(

UPD: continued my research, it seems one of workers exited unexpectedly

Worker with PID 13 exited unexpectedly (code -9). Shutting down...

OlegWock avatar Feb 12 '22 23:02 OlegWock

@tlmn-local did you have any luck with debugging/fixing this?

OlegWock avatar Feb 12 '22 23:02 OlegWock

@tlmn-local did you have any luck with debugging/fixing this?

I think it's a redis problem, when I try to change the broker to rabbitmq the problem disappears.

tlmn-local avatar Feb 13 '22 06:02 tlmn-local

Meh, I already use rabbitmq

OlegWock avatar Feb 13 '22 08:02 OlegWock

So, I figured this out. I'm feeling kinda dumb, but well, maybe this will be useful to someone who stumbles on this issue in future.

I checked kernel logs (/var/log/kern.log and /var/log/kern.log.1) and found out that worker was shut by OOMKiller. This did not make much sense for me, since workers didn't used even half of available 16GB. But then I remembered that in past these workers were deployed on smaller Droplet with 4GB of ram and I set 3GB memory limit in docker-compose file.

mem_limit: 3g

And that was the issue. Lifting up memory limit solved it

OlegWock avatar Feb 13 '22 09:02 OlegWock

This behavior is expected so things are working as intended. If workers die for any reason, the main process restarts everything.

Worker with PID 13 exited unexpectedly (code -9). Shutting down...

That error code indicates that the process was SIGKILLed (as you later found out, by the OOM killer).

Bogdanp avatar Sep 23 '22 08:09 Bogdanp