tonic icon indicating copy to clipboard operation
tonic copied to clipboard

Lameducking period for graceful shutdown / serve_with_shutdown

Open vandry opened this issue 5 months ago • 0 comments

Feature Request

Crates

tonic

Motivation

In tonic right now, serve_with_shutdown does not implement lameducking the way I am used to seeing. What I was expecting is that on SIGTERM idle keepalive connections would be disconnected, readiness health checks would start returning negative, but new connections would continue to be accepted (albeit with no keepalive) during a grace period in order to allow clients time to notice that we are lameducking and select different backends. Then eventually after a delay we would close the listening socket.

What actually happens is that we close the listening socket (or at least we stop calling accept() on it) immediately and then drain requests from existing connections.

Is that okay? I don't know, it depends on whether clients can be counted on to promply and transparently try a different backend when they get either a refused or reset connection (depending on the timing) during the short interval after we have started shutting down but before they have had a chance to update their backend list to exclude us. I feel like most gRPC clients might be all right there, but there are stories of 502s from ngnix floating about...

https://github.com/vandry/td-archive-server/commit/7e202e586ed0d3f19e576304ba1bd91ebc760edb

Proposal

First, I would like to solicit opinions about whether the behaviour I am looking for is needed or if the status quo is good enough. We definitely implement the lameducking delay as I describe it at very large scale at $DAYJOB but there might be other considerations here.

If the feature is deemed desirable then I propose:

  • Allow a lameducking delay to be configured, defaulting to zero, which makes serve_with_shutdown have the present behaviour.
  • If nonzero, first drain all existing connections and set max_connection_age to zero on all new ones. Then sleep for the delay while still accepting new connections. Then drop the listener and return.
  • It is the caller's responsibility (e.g. using the tonic-health crate) to flip health checks to bad when the shutdown signal is first sent.

Alternatives

See above; it is possible that the status quo is fine.

vandry avatar Sep 14 '24 15:09 vandry