gunicorn Reap workers in the main loop

trafficstars

Handle SIGCHLD like every other signal, waking up the arbiter and handling the signal in the main loop rather than in the signal handler.

Take special care to reinstall the signal handler in case Python avoided doing so to prevent infinite recursion.

Clean up workers and call the worker_exit hook in only one place. When killing a worker, do not clean it up. The arbiter will now clean up the worker and invoke the hook when it reaps the worker.

With reaping handled in the main loop and kill_worker delegating responsibility for cleanup to the reaping loop, iterate over the workers dictionary everywhere else without concern for concurrent modification.

Apr 20 '20 23:04 tilgovi

This is not safe as is due to this code that might cause Gunicorn to miss signals:

    def signal(self, sig, frame):
        if len(self.SIG_QUEUE) < 5:
            self.SIG_QUEUE.append(sig)
            self.wakeup()

Apr 20 '20 23:04 tilgovi

Ahh, I think I may have identified why we had SIGCHLD separately handled?

Is it because some systems exhibit a behavior where not calling os.waitpid in the signal execution context causes infinite recursion?

http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch10lev1sec7.html

Apr 21 '20 01:04 tilgovi

Python actually checks for the situation I linked to, taking care not to re-install the signal handler for SIGCHLD automatically if it might cause infinite recursion.

https://github.com/python/cpython/blob/62183b8d6d49e59c6a98bbdaa65b7ea1415abb7f/Modules/signalmodule.c#L334

I've modified the first commit to address that and commented the code to indicate this case.

Apr 21 '20 01:04 tilgovi

Can you clarify what is it trying to solve? Do we have a way to reproduce the issue?

Just code cleanup. It avoids the need to call list() on the workers list when we loop over it because it removes the concurrent modification, by removing duplicate code.

Aug 27 '20 01:08 tilgovi

I reworked this with a separate indicator flag that the arbiter uses to mark that it needs to reap workers in the next main loop iteration. This change ensures that the arbiter will always reap workers even if the signal queue is full. I also added better explanation for why it is good to re-install the handler.

Please take another look.

Sep 13 '20 20:09 tilgovi

The use of use of list(self.WORKERS.items() was mainly there because .items() was not considered safe. When a worker die the list is updated concurrently.

Anyway the patch looks fine. I need to test it on BSDs systems, in particularly OpenBSD where I remember the special handling of the CHLD signal was added for.

Can it wait next release (let's make it next month?) ?

Sep 17 '20 15:09 benoitc

The use of use of list(self.WORKERS.items() was mainly there because .items() was not considered safe. When a worker die the list is updated concurrently.

Not any more because that's what this patch changes!

Anyway the patch looks fine. I need to test it on BSDs systems, in particularly OpenBSD where I remember the special handling of the CHLD signal was added for.

I don't think any BSD has this problem. From what I read it's mos likely some versions of Solaris.1 In any case, setting the handler when it is already set is not a problem, so I'm not worried about regression here. The only possible change here would to fix behavior on a system where Gunicorn was previously broken.

Can it wait next release (let's make it next month?) ?

Absolutely.

Sep 19 '20 03:09 tilgovi

@benoitc would you take another look here, please? I think it's a good idea to reap on the main thread. I think this approach is easier to understand, safer, reduces duplicate code, and eliminates concurrent modification of the workers list.

Dec 27 '23 22:12 tilgovi

@tilgovi I will check this patch. But one of the reason we handled CHLD differently is that montoring is separate from handling signals targetting the process itself. Having all in in the same signal loop means that this signal wil be in the same queue.

Feb 02 '24 18:02 benoitc

I wasn't aware of this pr until now, but FYI, I've posted #3148 which also aims to handle SIGCHLD on the main thread, although in a slightly different way.

Feb 02 '24 20:02 sylt

I've been running & testing #3148 (@sylt's PR), so far without problems. It seems more complete – perhaps the two PRs could be combined, gaining the benefit of both?

And then there's https://github.com/benoitc/gunicorn/pull/2908 too of course.

Feb 02 '24 20:02 piskvorky

@tilgovi I will check this patch. But one of the reason we handled CHLD differently is that montoring is separate from handling signals targetting the process itself. Having all in in the same signal loop means that this signal wil be in the same queue.

We could put the SIGCHLD signal on the front of the queue, if what you want is just to handle it with higher priority, but I think it's a good idea to make the handler itself short and defer reaping to the main thread.

I also had a version of this with just a boolean flag for whether to reap on wakeup, if for some reason you really don't want SIGCHLD in the queue at all. But I don't think I understand in what way this signal is for "monitoring" or different from "signals targeting the process itself."

Can you explain anymore what concerns you?

Feb 03 '24 16:02 tilgovi

I see on the other PR that it seems like one hesitation you have is that you don't want to wait for children when exiting Gunicorn, but that already happens. While we reap children in the signal context, handling other signals is blocked.

Feb 03 '24 16:02 tilgovi

gunicorn gunicorn copied to clipboard

Reap workers in the main loop

gunicorn
gunicorn copied to clipboard