Worker timeout needs to be set too long to account for worker startup
(note - I'm not asking for someone to create a solution on this. I want to gauge if a proposed solution would be acceptable as a PR that I would create)
I've run into a situation where my worker startup takes a fairly long time, and I need to increase the worker timeout to cover that. But once the worker is up, that timeout is far too long to catch a problem, especially with gevent workers (where the timeout is not tied to request times). For example, my workers take 2-3 minutes to start up normally (with some outliers), but I really only want to have the timeout at 3-5 seconds so a worker failure is detected and handled quickly enough to not cause serious impact.
For a variety of reasons (complex code base, preload is not a working solution for us, not being able to get down to sub 10 seconds even at that) shortening worker startup isn't an option. What I would like to have is a configurable grace period for the timeout. So if I set the grace period to 5 minutes, this would be what happens:
- arbiter starts a new worker. The time of the start is tracked
- worker starts up - this will take 2 minutes to complete before the timeout loop is active
- arbiter checks for the worker checking in, which it is not. Since the time is currently earlier than
start time + grace period, it ignores it - Repeat step 3 for 2 minutes
- worker completes startup and starts the timeout checkin loop
- arbiter checks for the worker checking in, which it sees at
start time + 2 minutes. It clears the grace period - arbiter now expects to see the worker check in at least every
timeoutseconds, or it will reap the worker.
If for some reason the worker did not start properly and did not get to the timeout checkin loop, the arbiter would see this at "start time + grace period" and follow the normal process for reaping the worker and restarting it. If the worker.
If I've missed a way to do this already, I'd appreciate a pointer. Otherwise, if this sounds like a reasonable approach I am happy to create a PR for it for further review and refining.
This is why preload is there. You can preload common code and workers will sharee the code. Wouldn't it works for your case?
We can certainly cut down our load time by making better use of preload - right now it just doesn't work at all for us, and we'll need to refactor a number of things to use it (that's our problem, and not a reason to do this in the open source). Loading and processing configs, for example, can easily be done there. But there will still be a lot of work that can't be done in preload - instantiating metrics, setting up downstream clients and connections.
The problem is that this still tightly couples "how fast a worker starts up" to "how quickly we can respond to a problem". I can get the worker startup after fork down to probably 30 seconds, but that's still far longer than what my response time to a deadlocked worker should be.
I took a look at the change last night after I posted this, and it ended up being a lot more trivial than I expected. Outside of setting up the config parameter itself, it's 2-3 lines of code in WorkerTmp. Only reason I didn't put in a PR yet is because I wanted to add tests for that class first.
I’m experiencing the same problem in a production environment.
In my case, some Gunicorn workers take a long time to start because the application (a large monolith) performs heavy initialization steps, and even using --preload doesn’t fully help.
When a worker restarts, Kubernetes sometimes marks the service as unavailable because the liveness/readiness probes fail during this bootstrap period. A dedicated “startup grace period” would help prevent these false negatives and allow workers to finish booting properly before being evaluated as unhealthy.
I’ve been testing the proposed PR.
Thanks! @toddpalino