delayed_job Orphan workers after `restart` with a different `n` value

Orphan workers after `restart` with a different `n` value

Open dan-jensen opened this issue 5 years ago • 1 comments

PROBLEM Delayed Job essentially orphans workers when you change the worker count from 1 to 2+, or vice versa, and restart. By that I mean Delayed Job allows certain workers to continue running through stop, start and restart commands. The exact reasons are explained below, but the fundamental problem is that the delayed_job executable sometimes behaves like a service and other times like a worker management tool.

SIGNIFICANCE This problem probably has moderate significance. Worker counts are not changed frequently. However, restart is the most common way of administering workers in production environments. More importantly, orphan workers can eat up memory, slow down servers and even cause insufficient memory failures. So while it might not happen frequently, it can happen easily in production environments and be serious when it does.

REPLICATION INSTRUCTIONS FOR MULTIPLE ORPHANS (DECREASE WORKER COUNT)

Execute script/delayed_job -n 5 restart
Observe there are 5 Delayed Job processes running, as expected
Execute script/delayed_job -n 1 start
Observe there are 6 Delayed Job workers running, when expecting 1; 5 orphan workers

REPLICATION INSTRUCTIONS FOR ONE ORPHAN (INCREASE WORKER COUNT)

Execute script/delayed_job -n 1 start
Observe there is 1 Delayed Job process running, as expected
Execute script/delayed_job -n 5 restart
Observe there are 6 Delayed Job processes running, when expecting 5; 1 orphan worker

ANALYSIS The two examples above actually have different causes:

Increasing worker count causes an orphan worker due to the different naming convention for PID files when there is a single worker versus multiple (which has not changed since its introduction). But eliminating one naming convention (the single worker convention) would make orphans of all workers that currently exist under that convention. That would make this orphan worker problem worse, rather than better.
Decreasing worker count causes orphan worker(s) because Delayed Job only issues stop commands (via the Daemonize gem) for the first n workers specified. Worker n+1 and beyond are allowed to continue running through future commands and therefore orphaned.

PROPOSED SOLUTION A solution to both causes: improve worker termination to avoid orphans. Specifically, restart should stop ALL workers (without regard for the n argument value) before starting n workers.

From a broader perspective, the proposed solution is that the delayed_job executable should NOT be like a worker management tool for starting and stopping arbitrary numbers of workers. It SHOULD be like a service that starts and stops all workers. That is how it makes the most intuitive sense. Plus, that is already how it works in the most basic context: start. Without any Delayed Job workers running, execute bin/delayed_job start twice. On the second invocation you encounter: "ERROR: there is already one or more instance(s) of the program running". (That error actually comes from the Daemons gem, but that's irrelevant.) This is the expected behavior of a service – if it is already running you can't start it again. If the delayed_job executable were a worker management tool then the second invocation would simply result in one more worker being started. Further, if this were a worker management tool, it should have a command to tell you how many workers are currently running, so you know how many to stop or restart, and it doesn't. This executiable is clearly intended to behave like a service, but isn't in the restart context (and others - see "FOLLOW-ON IMPROVEMENTS" below).

Please see PR #1090 for a proposed solution to the described issue with restart, which will also lay the foundation for follow-on improvements.

FOLLOW-ON IMPROVEMENTS In addition to resolving the behavior of restart, these changes would also be required if the delayed_job executable is going to behave like a service:

start should raise an exception when any number of workers is running, not just the same number that was previously started (start then start raises an exception, but start then start -n 2 does not, again because of the different PID file naming conventions)
stop should stop all workers, not just the number specified by n. This is likely the fix for Issue #212.

Jun 02 '19 13:06 dan-jensen

In another issue the culprit was orphaned DelayedJob processes that needed to be manually killed from the command line. The proposal in this issue would solve that problem more permanently.

Apr 09 '24 18:04 dan-jensen

delayed_job delayed_job copied to clipboard

Orphan workers after `restart` with a different `n` value

delayed_job
delayed_job copied to clipboard