envoy
envoy copied to clipboard
Intermittent HTTP/2 Connection Hangs and Timeouts Post-Upgrade to v1.29
Description
We have observed a significant increase in the occurrence of 408 status codes accompanied by max_duration_timeout
responses during the transition from version 1.28 to 1.29. Our suspicion is that this issue is related to oghttp2, as disabling it using the reloadable flag http2_use_oghttp2 = false
resolves the problem.
The observed behavior entails requests from clients hanging for the entire duration of the stream_timeout
on the individual routes, with no activity detected on the connection. Ultimately, this results in a 408 status code with max_duration_timeout
.
We switched to v1.29 from v1.28 on 01/23 and turned off oghttp2
on 02/10.
Repro Steps
Though the issue is intermittent and lacks a consistent reproduction scenario, our analysis from the access logs provides some insights:
- Downstream connections use HTTP/2.
- Upstream connections comprise a mixture of HTTP/2 and HTTP/1.1.
- All the requests that become stuck are of the
:method
type POST. - Access logs indicate that no body is received from the client, with the number of bytes received reported as 0.
- The number of bytes sent as always 27 i.e. the length of
max_duration_timeout
. - The upstream request attempt count get reported as 1.
cc @alyssawilk If you have any ideas on what might trigger this behavior.
cc @birenroy
I'd like to confirm my understanding of the problem. Clients speaking HTTP/2 send POST requests to Envoy. Since the upstream_request_attempt_count is nonzero, it seems like the Envoy is forwarding request headers to the upstream. However, the Envoy doesn't read or forward any message body data from the client.
Is that right?
- Is there any other commonality among the affected requests?
- Are these the first requests on a connection?
- Would it be possible to get example request headers and the post body length for one of the failures? Redacting any sensitive field names or values would be fine.
Could you also provide the initial stream and connection flow control window sizes that your Envoy is using?
The symptoms sound like a possible flow control issue, but the flow control window update algorithm should be the same across the two codecs.
We identified a potential flow control leak in oghttp2. Please stand by for a fix.
We saw connection hang in https://github.com/envoyproxy/envoy/issues/31710 as well. Maybe the fix was not backported to 1.29?
We saw connection hang in #31710 as well. Maybe the fix was not backported to 1.29?
The fix for #31710 was included in v1.29. This looks like a separate issue. I will see if I can update the QUICHE dependency to the fixed version.
@birenroy Thanks for looking into it. I can get you the HAR file and provide more information to what you asked before.
We identified and fixed a slow flow control leak, which could be the cause of the hangs you experienced. If you're able to pull the latest revision of v1.29, please give that a try.
Otherwise, I will see if it's possible to create another point release once we finish debugging https://github.com/envoyproxy/envoy/issues/32371.
Can you help give as an idea of what type of request would trigger this issue? Only a small subset of our upstream clusters & paths were affected and we are trying to figure out why those paths specifically were affected and why the issue was not more widespread. Thanks!
cc @birenroy
Can you help give as an idea of what type of request would trigger this issue?
The fix in https://github.com/google/quiche/commit/9e875938052443c7f51b25012093be59e20598ac relates to the handling of HTP/2 DATA
frames with padding. We weren't properly accounting for the padding in flow control, so over time the connection flow control window would get out of sync and the connection would deadlock.
That bug was only triggered by DATA
frame padding, and I imagine not all HTTP/2 clients and servers generate padding.
I think we can mark this closed until we have evidence of connection hangs unrelated to padding or flow control.