delayed_job
delayed_job copied to clipboard
Orphan workers after `restart` with a different `n` value
PROBLEM
Delayed Job essentially orphans workers when you change the worker count from 1 to 2+, or vice versa, and restart
. By that I mean Delayed Job allows certain workers to continue running through stop
, start
and restart
commands. The exact reasons are explained below, but the fundamental problem is that the delayed_job
executable sometimes behaves like a service and other times like a worker management tool.
SIGNIFICANCE
This problem probably has moderate significance. Worker counts are not changed frequently. However, restart
is the most common way of administering workers in production environments. More importantly, orphan workers can eat up memory, slow down servers and even cause insufficient memory failures. So while it might not happen frequently, it can happen easily in production environments and be serious when it does.
REPLICATION INSTRUCTIONS FOR MULTIPLE ORPHANS (DECREASE WORKER COUNT)
- Execute
script/delayed_job -n 5 restart
- Observe there are 5 Delayed Job processes running, as expected
- Execute
script/delayed_job -n 1 start
- Observe there are 6 Delayed Job workers running, when expecting 1; 5 orphan workers
REPLICATION INSTRUCTIONS FOR ONE ORPHAN (INCREASE WORKER COUNT)
- Execute
script/delayed_job -n 1 start
- Observe there is 1 Delayed Job process running, as expected
- Execute
script/delayed_job -n 5 restart
- Observe there are 6 Delayed Job processes running, when expecting 5; 1 orphan worker
ANALYSIS The two examples above actually have different causes:
- Increasing worker count causes an orphan worker due to the different naming convention for PID files when there is a single worker versus multiple (which has not changed since its introduction). But eliminating one naming convention (the single worker convention) would make orphans of all workers that currently exist under that convention. That would make this orphan worker problem worse, rather than better.
- Decreasing worker count causes orphan worker(s) because Delayed Job only issues
stop
commands (via the Daemonize gem) for the firstn
workers specified. Worker n+1 and beyond are allowed to continue running through future commands and therefore orphaned.
PROPOSED SOLUTION
A solution to both causes: improve worker termination to avoid orphans. Specifically, restart
should stop ALL workers (without regard for the n
argument value) before starting n
workers.
From a broader perspective, the proposed solution is that the delayed_job executable should NOT be like a worker management tool for starting and stopping arbitrary numbers of workers. It SHOULD be like a service that starts and stops all workers. That is how it makes the most intuitive sense. Plus, that is already how it works in the most basic context: start
. Without any Delayed Job workers running, execute bin/delayed_job start
twice. On the second invocation you encounter: "ERROR: there is already one or more instance(s) of the program running". (That error actually comes from the Daemons gem, but that's irrelevant.) This is the expected behavior of a service – if it is already running you can't start it again. If the delayed_job executable were a worker management tool then the second invocation would simply result in one more worker being started. Further, if this were a worker management tool, it should have a command to tell you how many workers are currently running, so you know how many to stop or restart, and it doesn't. This executiable is clearly intended to behave like a service, but isn't in the restart
context (and others - see "FOLLOW-ON IMPROVEMENTS" below).
Please see PR #1090 for a proposed solution to the described issue with restart
, which will also lay the foundation for follow-on improvements.
FOLLOW-ON IMPROVEMENTS
In addition to resolving the behavior of restart
, these changes would also be required if the delayed_job
executable is going to behave like a service:
-
start
should raise an exception when any number of workers is running, not just the same number that was previously started (start
thenstart
raises an exception, butstart
thenstart -n 2
does not, again because of the different PID file naming conventions) -
stop
should stop all workers, not just the number specified byn
. This is likely the fix for Issue #212.
In another issue the culprit was orphaned DelayedJob processes that needed to be manually killed from the command line. The proposal in this issue would solve that problem more permanently.