gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

arbiter: don't log if handling SIGCHLD

Open hydrargyrum opened this issue 10 months ago • 11 comments

Logging when handling a signal is a bad practice See https://docs.python.org/3/library/logging.html#thread-safety

Fixes https://github.com/benoitc/gunicorn/issues/2816

hydrargyrum avatar Sep 05 '23 10:09 hydrargyrum

Thanks but this is done on purpose like the comment says. It's not common. I prefer to keep it like this. This place is threadsafe also.

benoitc avatar Sep 05 '23 11:09 benoitc

@benoitc what is done on purpose? raising exceptions and breaking logging system? Please see https://github.com/benoitc/gunicorn/issues/2816 that many people encounter

hydrargyrum avatar Sep 05 '23 12:09 hydrargyrum

This code should be threadsafe. I suspect the error is happening before. We need first to understand why this code is re-entrant here.

Let keep it open since it is actually solving the cause. But I woud insist that we need to understand why this comes now.

benoitc avatar Sep 05 '23 12:09 benoitc

linking to https://github.com/benoitc/gunicorn/issues/3050 I think we should revisit this change that introducde more logging.

@hydrargyrum if you happen to be around i'm on irc.

benoitc avatar Sep 05 '23 12:09 benoitc

Wanted to shine a little bit of info I found since I have encountered this bug as well. The issue isn't with thread-safety but rather with re-entrancy within the logging module and the underlying BufferedWriter (from IO module).

The issue occurs (usually on a slow disk) when the SIGCHLD handler interrupts the main process while it was in the process of writing its logging buffer. The handler (which is the same thread) will try to write to that buffer but will be unable because it doesn't allow reentrant code, it will then raise an exception.

If logger is set to write stderr to that same stream or file, it then sets off a cascading reentrant code error, eventually resulting in a recursion error.

It is a hard bug to reproduce, because the main thread has to interrupt itself exactly when it was writing the buffer.

Ref: https://docs.python.org/3/library/io.html#reentrancy, https://github.com/python/cpython/blob/e24f9ae7035cfb4c11c4a5c2820121fc22463f8b/Modules/_io/bufferedio.c#L295C11-L295C11

ja0x avatar Oct 10 '23 14:10 ja0x

@ja0x is correct. Python programs MUST NOT print or log anything inside a signal handler. https://stackoverflow.com/q/45680378 is another link that explains the root cause of this problem.

This gunicorn bug has a long and sad history.

Dec 2017: Issue #1663 was reported: In some cases (for example OOM) workers die without any error logged, or even crashloop. Dec 2019: Issue #2215 was reported: Essentially the same as #1663, but with a reproducible example, and focused on missing logging. Apr 2020: PR #2315 (ee1e5c1928c71fa366dfac99c65f630e03aed8b7) was created by @tilgovi. It logs an error in the SIGCHLD handler if a worker exits with a nonzero code. And adds more logging to arbiter halt (that part is okay). Sep 2020: PR #2419 (28df9926d7c650248d267eeafaa8b2c2cdd9af38, ca36d410fd86c82cb2c998a1dab5f7f967405fdd) was created by @mildebrandt trying to fix issue #1663. It logs an error in the SIGCHLD handler if a worker exits with a nonzero code or is killed by a signal. Dec 2020: PR #2475 (b695b497b9b8e7351808848ab9e342219601742a) was created by @aberres trying to fix issue #2215, and immediately merged by @benoitc. It logs a warning in the SIGCHLD handler if a worker is killed by a signal. Apr 2021: gunicorn 20.1.0 was released containing PR #2475. Soon, issue #2564 was reported: Users started experiencing crashes with RuntimeError: reentrant call. Oct 2021: PR #2667 was created by @BoasKluiving to revert PR #2475 in order to fix issue #2564. Dec 2021: PR #2475 was reverted by @benoitc in commit 76f8da24cbb992d168e01bda811452bcf3b8f5b3 in order to fix issue #2564. Jan 2022: PR #2667 was closed by @javabrett because he noticed that PR #2475 was reverted in the meantime. Jun 2022: Issue #2816 was reported by @InfernalAzazel. AFAICT, it's actually just an exact duplicate of issue #2564, because the fix in commit 76f8da24cbb992d168e01bda811452bcf3b8f5b3 wasn't released. Dec 2022: PR #2908 was created by myself. It logs a warning if a worker is killed by a signal, but the log is written later in the main loop, not directly in the SIGCHLD handler. I did not receive any response in the PR. May 2023: PR #2315 and PR #2419 were merged by @benoitc, reintroducing the RuntimeError: reentrant call bug. Issue #2564 was closed (incorrectly). Jul 2023: gunicorn 21.0.0 was released containing PRs #2315 and #2419. Aug 2023: Issue #3050 was reported by @feliixx, who wants the log to be a warning, not an error. Sep 2023: PR #3064 (this PR) was created by @hydrargyrum, trying to fix issue #2816 by skipping logging in the SIGCHLD handler. (FWIW: you missed one self.log.error call, and nobody except handle_chld calls reap_workers anyway.)

If I may be frank, I feel very disappointed and frustrated by the gunicorn project. It is a cruel irony that my PR #2908 which attempts to handle this correctly was ignored, but PRs #2315 and #2419 were merged, causing a regression.

But I'll try to swallow my bitter feelings and offer constructive advice.

My advice to @benoitc is to do this:

  1. git revert -m 1 1fd1c82195 (revert both commits of #2419)
  2. git revert ee1e5c1928 (revert the bad half of #2315, keep the good half)
  3. Review and (if I may be so bold) merge my PR #2908
  4. Reject PR #3064 (this PR)

Doing that fixes #2564 properly (at least I think so), fixes #3050, doesn't regress #2215, and removes the need for PR #3064 (this PR).

Hope that helps.

TomiBelan avatar Oct 31 '23 22:10 TomiBelan

This is allextremely helpful and politely offered advice. I'll try to find time to push this along. I appreciate all your effort here and the exception patience with helping to sort this out.

Most of the maintainers are volunteers and we don't always find the sustained time to reconstruct all of this context, so gathering it in one place is immensely helpful.

tilgovi avatar Oct 31 '23 23:10 tilgovi

@TomiBelan Thank you for the write-up. Since my failed approach three years ago I am silently hoping to get the issue fixed for good :)

aberres avatar Nov 01 '23 06:11 aberres

I've rebased #2314, which moves reaping to the main thread. That should move the logging out of the signal handler. Any reviews or testing of that would be appreciated. I've assigned myself the related issues to follow up on any other changes afterwards.

tilgovi avatar Dec 28 '23 00:12 tilgovi

Also bitten by this. I debugged and googled my way to this PR – I have nothing to add to @ja0x and @TomiBelan excellent analysis, so just a +1. I experience the same issue with reentrant logging inside Arbiter.reap_workers() during SIGCHLD https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L242

To make matters worse, in my case gunicorn is auto-launched and monitored from supervisord, running a Flask app. After the unhandled exception due to RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>, the server port was not released correctly, and the newly restarted gunicorn arbiter process (launched by supervisor) keeps printing this forever:

[2024-01-25 06:50:00 +0000] [384109] [INFO] Starting gunicorn 21.2.0
[2024-01-25 06:50:00 +0000] [384109] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2024-01-25 06:50:00 +0000] [384109] [ERROR] Retrying in 1 second.
[2024-01-25 06:50:01 +0000] [384109] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2024-01-25 06:50:01 +0000] [384109] [ERROR] Retrying in 1 second.
…

I.e. the server is bricked after the [ERROR] Unhandled exception in main loop, somehow.

On a positive note, I've been using gunicorn for years and this is the first major issue. Kudos and huge thanks to the maintainers!

piskvorky avatar Feb 02 '24 14:02 piskvorky

Without really checking what PRs were opened already, I posted a PR yesterday (#3148) with an attempt at solving various signal handling issues in gunicorn I had the misfortune to encounter. I couldn't see that any of the open proposals had used the same approach to solve the problem, so I thought people could have a look before I eventually close it.

sylt avatar Feb 02 '24 15:02 sylt