quic-rpc
quic-rpc copied to clipboard
Really allow graceful shutdown
Ok, that's a pretty rubbish title. But what we really need to be able to do is:
- Signal shutdown to the server.
- Server stops accepting new connections.
- Server keeps handling existing connections.
- Once there are no more connections the server finishes, an event you can away.
The way the server is used in iroh-memstore would allow making use of this to nicely terminate. And I'm pretty sure the hyper and quinn transports allow us to implement this here.
This is kind of already done for the http2 transport, but because there was no use yet for the explicit waiting API it was not handled so needs extending.
So this is two parts:
- implement graceful shutdown for the remaining transports
- Expose a consistent API
Question: do we need this for the memory transport? Probably not, right?
I think also the memory channel is affected. You could have an RPC running while the ServerChannel is being closed. You still want that rcp channel to stay alive until it is done.
Though this does pose a question wrt to how to handle an rpc which is infinitely streaming like the watch. I hadn't considered that yet. I guess you need a grace period or some other way to handle those.
Maybe that is a stupid question, but what is the actual benefit of a graceful shutdown? I expect services to handle all sorts of non graceful scenarios such as network outages and power outages without getting into a weird state. I also want to be able to kill a process with kill -9 without something bad happening.
So why not always shut down non-gracefully, if it makes things simpler?
I guess that is me being influenced by the erlang "let it crash" philosophy that also made it into akka, and that I found quite appealing...
https://medium.com/@vamsimokari/erlang-let-it-crash-philosophy-53486d2a6da
For completeness (and because I'm waiting on CI... :wink: ) I'll add some reasons here from the out-of-band discussion:
- When doing rolling-restarts of a group of servers you want clients to not notice. This means their existing requests should finish without error while they would send new requests to a different server. This can significantly reduce the error rate.
- It allows seeing regular maintenance independently from failures in metrics, also on the client side. It is useful to have your client- and server-side metrics agree on error ratios etc.
- This is additional to clients having to handle server errors, both contribute to having a reliable system.
- Any ports in use by the server should be ready for re-use to aid fast restarts. After a crash the OS might not release ports right away.
The erlang let-it-crash philosophy is about how to handle errors. It's a good philosophy in the face of decent process supervision. But this doesn't have to mean the only way to shut down a server is to crash it.