tonic: add max connection age grace
Motivation
According to the gRPC ~documentation~ documentation when a connection exceeds the graceful shutdown timeout, the server should begin a secondary forceful shutdown timer. If the client still does not close the connection within this period, the server must terminate the connection.
Problem
Tonic currently does not enforce this second-phase shutdown. As a result, in long-running applications where clients become congested, stuck or simply ignore the GOAWAY frame, connections can linger indefinitely. This leads to unbounded growth in memory usage and a continuously increasing number of live Tokio tasks.
Solution
This PR introduces a max_connection_age_grace timeout that starts after max_connection_age expires. If the client does not close the connection within this grace period, the server forcefully closes it. This mirrors the behavior of the gRPC Go implementation.
A similar idea was previously mentioned in a closed PR but it seems the discussion stalled before the feature was implemented.
@LucioFranco any chance I could get a review from someone? Thank you!
Hi, If I understand correctly https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md and https://grpc.io/docs/guides/server-graceful-stop/ are not particularly related.
https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md is about garbage collection of old connections, while graceful shutdown is about server shutdowns. The server will proceed to terminate open connections after a timeout regardless of the underlying connection management after the server shutdown timeout forecefully.
So, while this PR itself looks okay, I am not particularly sure this'll solve issues around server shutdown you may be facing, but it should help reduce unbounded memory growth due to unresponsible clients.
Hey @sauravzg, thank you.
I'm not facing any server shutdown issue, when I say shutdown I imply a connection shutdown and not a server shutdown.
As you already said the main purpose is to reduce the memory growth due to unresponsible clients.
Hey folks, any update on that? Could we merge the PR? cc @LucioFranco