Add drainTimeout property to Http2Protocol/Http2UpgradeHandler

Open kaiburjack opened this issue 3 months ago • 0 comments

This is to delay sending the second GOAWAY (with last seen stream id) after sending the first GOAWAY (with max stream id), in order to mitigate a race between the client already having buffered frames for new streams after the server has sent the first graceful GOAWAY frame and the second GOAWAY.

Once the second GOAWAY frame reaches the client, its own HTTP/2 client implementation may then itself signal a stream reset (rejected stream) error condition to the client application or might send the new forbidden stream frames to the server, where it then would be rejected.

Currently, Tomcat's Http2UpgradeHandler uses the last computed/determined Round-Trip-Time (RTT) and delays only by that amount (which might only be a few microseconds), following the suggestion of RFC 9113 of using "at least one round-trip time". However, due to network conditions and server load and imprecise timers due to OS thread scheduling delays or hypervisor-limited CPU credits (think Google Cloud Platform E2 machine types with scheduler throttling) one RTT might race with the client already having buffered up writes for new streams/requests before it had the chance to see either the first or the last GOAWAY sent by Tomcat.

Therefore, we implement the same behaviour as provided by the Envoy proxy's HTTP Connection Manager drain_timeout: Delay sending the second final GOAWAY after sending the first graceful GOAWAY to give the client enough time to react to the first GOAWAY and stop sending new streams and close the connection on its side first.

Testing this with an Envoy Proxy 1.36.2 client, when using the patched Tomcat server, all 503 errors generated by Envoy due to HTTP/2 stream reset errors completely vanished when doing a Kubernetes rolling update of a deployment. Before, we saw occasional 503 responses due to stream resets when Tomcat would return the final GOAWAY frame too fast.

Bugzilla Enhancement Ticket: https://bz.apache.org/bugzilla/show_bug.cgi?id=69870

Nov 05 '25 22:11 kaiburjack