gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

Error instead of warning in logs when sending SIGTERM since v21

Open feliixx opened this issue 2 years ago • 10 comments
trafficstars

Hi @benoitc,

Thanks a lot for all the work you've done on Gunicorn!

We recently upgraded from version 20.1.0 to 21.2.0 and noticed in our logs when restarting our application that a warning has been replaced by an error when sending SIGTERM to a worker:

version 20.1.0:

systemd: Stopping App...
app: gunicorn WARNING Worker with pid 2542594 was terminated due to signal 15
app: gunicorn INFO Handling signal: term
app: gunicorn INFO Worker exiting (pid: 2542537)
app: gunicorn INFO Worker exiting (pid: 2542545)
app: gunicorn INFO Shutting down: Master
systemd: Stopped App.
systemd: Started App.

version 21.2.0:

systemd: Stopping App...
app: gunicorn ERROR Worker (pid:1340527) was sent SIGTERM!
app: gunicorn INFO Handling signal: term
app: gunicorn INFO Worker exiting (pid: 1340420)
app: gunicorn INFO Worker exiting (pid: 1340424)
app: gunicorn INFO Shutting down: Master
systemd: Stopped App.
systemd: Started App.

This is a bit annoying for us as we carefully monitor any error sent by our app, and this seems to be a false positive.

The commits introducing this change are 76f8da2 and 28df992, however it's not clear from the commit messages why the log level was changed. The documentation about signals does say that sending SIGTERM is the proper way to do a graceful shutdown.

Is it possible to change the log level back to warning, at least for SIGTERM? We'll be happy to send a PR if you agree.

feliixx avatar Aug 07 '23 13:08 feliixx

Hello, any update on this ?

Venture200 avatar Aug 28 '24 10:08 Venture200

Hello, also effected by this issue. Any update on this ?

pauljeannot avatar Oct 01 '24 13:10 pauljeannot

For other people affected by this: we've managed to sidestep the issue by using KillMode=mixed in our systemd service config:

[Service]
KillMode=mixed
...

We've been using this config in production for over a year with Gunicorn 21.2.0 and above without issues.

feliixx avatar Oct 01 '24 14:10 feliixx

For my understanding, using KillMode=mixed means that while the master process receives SIGTERM for a graceful shutdown, the workers are immediately sent a SIGKILL, which prevents them from completing any pending tasks or writing logs (so no false positive in your monitoring).

Am I right?

pauljeannot avatar Oct 01 '24 15:10 pauljeannot

For my understanding, using KillMode=mixed means that while the master process receives SIGTERM for a graceful shutdown, the workers are immediately sent a SIGKILL, which prevents them from completing any pending tasks or writing logs

Exactly. As you noted, that's not ideal as the worker won't shutdown gracefully but instead are immediately killed, so it might be problematic in some cases.

feliixx avatar Oct 02 '24 06:10 feliixx

the log level is indeed wrong. This will be fixed in coming release

benoitc avatar Oct 02 '24 06:10 benoitc

@benoitc Thanks! I just opened a PR for it, if it can help.

pauljeannot avatar Oct 04 '24 21:10 pauljeannot

I'm gonna go against the stream here, but I honestly don't think the log level is wrong. Instead, I think we should ask: Why do we even hit this trace/code path from the beginning? If SIGTERM is caught by the worker as a mean of doing a graceful shutdown (meaning the worker should exit normally with exit code 0), why are we executing a code path for handling workers terminated by a signal? I would argue that if the worker exited normally, we shouldn't even be in that code path.

I've tried to address what I consider to be the root cause in #3312. It's more intricate than changing the log level of course, but I think it's more correct. The benefit of keeping the log level as is would be that other signals (such as e.g. SIGSEGV and SIGKILL) are logged as errors.

sylt avatar Oct 05 '24 10:10 sylt

@sylt I disagree; sorry. We don't use Gunicorn's settings to automatically restart workers; instead, we do so manually (well automatically) based on a number of factors that we monitor ourselves (e.g., pod movements, memory spikes, deploys, etc.). We first gracefully move new traffic away to other Gunicorn pods and then manually send SIGTERM to the master process. This results in us getting these errors in the logs that we don't want as there nothing wrong and hasn't been any error.

For more context on why, if we let Gunicorn restart workers as it pleases (e.g. based on max_requests), this causes 502 errors to be returned by Caddy (which sits in front of it) for a very brief period, as there are no workers available for the new requests. Sure, we can let Caddy auto-retry, which we do for GET requests, but we cannot let retries happen blindly for requests like PUT/POST as Caddy doesn't know why there was no Gunicorn available and so we might process the same request multiple times.

I hope this makes sense and shows why for us this is a problem.

violuke avatar Oct 05 '24 10:10 violuke

@violuke No problem :) I don't think we're really disagreeing, I'm just questioning the implementation of how to prevent the error from happening! Because I totally agree that you shouldn't be seeing an error in the scenario you describe. My main point here is that I'm wondering if we should even be hitting the code path that produces the error from the very beginning if the worker exited normally. It is with that background I didn't think lowering the log level was the obvious solution from an implementation point of view.

In the PR I posted, what I had in mind fixing was the SIGHUP case (when the arbiter restarts all of its workers), as that is where I noticed these errors myself. This is what I thought systemd was doing, but this is most likely where I was mistaken, no? In your case, when you're sending SIGTERM manually yourself only to the main process, I honestly can't figure out why the error would occur. Only if all child processes however were sent a SIGTERM at the same time, I would have understood it though, as it's easy to reproduce. In that case though, it'd argue it's user fault.

With all that said, lowering the the log level is perhaps the easiest/best way to solve the issue for now, I just think it should be done for the right reasons.

sylt avatar Oct 05 '24 12:10 sylt

I'm using gunicorn (version 23.0.0) and still have the log at error instead of warning .

Was something preventing this issue from being fixed?

camilonova avatar Sep 20 '25 16:09 camilonova

Seeing quite a bit of noise from these error logs, it would be nice to have a definitive resolution on the log level to allow practitioners to adapt their codebases accordingly.

purificant avatar Oct 01 '25 12:10 purificant