envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Intermittent HTTP/2 Connection Hangs and Timeouts Post-Upgrade to v1.29

Open agrawroh opened this issue 1 year ago • 9 comments

Description

We have observed a significant increase in the occurrence of 408 status codes accompanied by max_duration_timeout responses during the transition from version 1.28 to 1.29. Our suspicion is that this issue is related to oghttp2, as disabling it using the reloadable flag http2_use_oghttp2 = false resolves the problem.

The observed behavior entails requests from clients hanging for the entire duration of the stream_timeout on the individual routes, with no activity detected on the connection. Ultimately, this results in a 408 status code with max_duration_timeout.

image

We switched to v1.29 from v1.28 on 01/23 and turned off oghttp2 on 02/10.

Repro Steps

Though the issue is intermittent and lacks a consistent reproduction scenario, our analysis from the access logs provides some insights:

  • Downstream connections use HTTP/2.
  • Upstream connections comprise a mixture of HTTP/2 and HTTP/1.1.
  • All the requests that become stuck are of the :method type POST.
  • Access logs indicate that no body is received from the client, with the number of bytes received reported as 0.
  • The number of bytes sent as always 27 i.e. the length of max_duration_timeout.
  • The upstream request attempt count get reported as 1.

agrawroh avatar Feb 14 '24 08:02 agrawroh

cc @alyssawilk If you have any ideas on what might trigger this behavior.

agrawroh avatar Feb 14 '24 08:02 agrawroh

cc @birenroy

alyssawilk avatar Feb 14 '24 14:02 alyssawilk

I'd like to confirm my understanding of the problem. Clients speaking HTTP/2 send POST requests to Envoy. Since the upstream_request_attempt_count is nonzero, it seems like the Envoy is forwarding request headers to the upstream. However, the Envoy doesn't read or forward any message body data from the client.

Is that right?

  • Is there any other commonality among the affected requests?
  • Are these the first requests on a connection?
  • Would it be possible to get example request headers and the post body length for one of the failures? Redacting any sensitive field names or values would be fine.

birenroy avatar Feb 15 '24 16:02 birenroy

Could you also provide the initial stream and connection flow control window sizes that your Envoy is using?

The symptoms sound like a possible flow control issue, but the flow control window update algorithm should be the same across the two codecs.

birenroy avatar Feb 15 '24 17:02 birenroy

We identified a potential flow control leak in oghttp2. Please stand by for a fix.

birenroy avatar Feb 15 '24 20:02 birenroy

We saw connection hang in https://github.com/envoyproxy/envoy/issues/31710 as well. Maybe the fix was not backported to 1.29?

milton0825 avatar Feb 15 '24 21:02 milton0825

We saw connection hang in #31710 as well. Maybe the fix was not backported to 1.29?

The fix for #31710 was included in v1.29. This looks like a separate issue. I will see if I can update the QUICHE dependency to the fixed version.

birenroy avatar Feb 15 '24 23:02 birenroy

@birenroy Thanks for looking into it. I can get you the HAR file and provide more information to what you asked before.

agrawroh avatar Feb 16 '24 05:02 agrawroh

We identified and fixed a slow flow control leak, which could be the cause of the hangs you experienced. If you're able to pull the latest revision of v1.29, please give that a try.

Otherwise, I will see if it's possible to create another point release once we finish debugging https://github.com/envoyproxy/envoy/issues/32371.

birenroy avatar Feb 16 '24 23:02 birenroy

Can you help give as an idea of what type of request would trigger this issue? Only a small subset of our upstream clusters & paths were affected and we are trying to figure out why those paths specifically were affected and why the issue was not more widespread. Thanks!

cc @birenroy

corinneherz-db avatar Mar 04 '24 19:03 corinneherz-db

Can you help give as an idea of what type of request would trigger this issue?

The fix in https://github.com/google/quiche/commit/9e875938052443c7f51b25012093be59e20598ac relates to the handling of HTP/2 DATA frames with padding. We weren't properly accounting for the padding in flow control, so over time the connection flow control window would get out of sync and the connection would deadlock.

That bug was only triggered by DATA frame padding, and I imagine not all HTTP/2 clients and servers generate padding.

birenroy avatar Mar 04 '24 22:03 birenroy

I think we can mark this closed until we have evidence of connection hangs unrelated to padding or flow control.

birenroy avatar Apr 01 '24 16:04 birenroy