gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

provide a way to log a stack trace upon abnormal termination

Open adk-swisstopo opened this issue 7 months ago • 3 comments
trafficstars

For troubleshooting purpose it is often useful to log a stack trace when a process is stuck in an abnormal state. AFAICT with gunicorn asynchronous workers, there is no good way to do that. In https://github.com/geoadmin/service-stac/pull/558 we implemented our own worker class to override handle_exit and dump the stacks there. However that means we get stack traces for graceful terminations too. It would be nice if we could have a way to do that only when the worker exceeds the graceful_timeout upon termination.

One way to resolve this:

  • SIGTERM triggers graceful termination
  • wait graceful_timeout
  • (new) if worker has not completed, trigger a quick shutdown (SIGTINT/SIGQUIT) (handle_quit can be implemented to dump the stacks)
  • (new) wait quick_shutdown_timeout (new setting)
  • SIGKILL

The way Arbiter.stop is implemented currently, the worker only ever receive SIGTERM or SIGQUIT but never both.

adk-swisstopo avatar Apr 15 '25 16:04 adk-swisstopo

It would be nice if we could have a way to do that only when the worker exceeds the graceful_timeout upon termination.

Makes me wonder if you are already able to work around the 2-signal escalation sequence with something very close to the suggested 3-signal sequence, given that the workers in the general case (exception: graceful reload changing timeout) already know what is coming and when it is coming. Register a handler and set an alarm at SIGTERM receipt, to dump after cfg.graceful_timeout - quick_shutdown_timeout. If the worker is shutdown regularly by then.. well, then nothing happens. If its alive, it got approx quick_shutdown_timeout remaining to explain why that is so.

pajod avatar Apr 15 '25 21:04 pajod

Yes, we can probably set an alarm in the worker (and I'll probably look into that if there is no interest in updating the arbiter). But having multiple ways to send signals can quickly make it harder to reason about the whole system. Furthermore, if every user needs to reimplement that themselves, they will have to rediscover all the subtleties of proper signal handling the hard way.

adk-swisstopo avatar Apr 16 '25 06:04 adk-swisstopo

For the gevent case, there is a gevent.spawn_later method that comes handy: https://github.com/geoadmin/service-stac/pull/563

adk-swisstopo avatar Apr 16 '25 10:04 adk-swisstopo