envoy icon indicating copy to clipboard operation
envoy copied to clipboard

RST_STREAM sent to downstream client instead of remaining DATA, HEADERS on end stream

Open niloc132 opened this issue 2 years ago • 21 comments

Title: RST_STREAM sent to downstream client instead of remaining DATA, HEADERS on end stream

Description: In some circumstances, it is possible for envoy to skip sending remaining DATA, HEADERS frames and instead send RST_STREAM, resulting in errors in the downstream client. In this case, we have a gRPC service returning data to a client, and some client implementations get an error instead of seeing the expected grpc-status: 0 trailers. The stream needs to be bidirectional, with the client having not half-closed from its side yet (which in some cases offers a potential workaround).

Downstream error reported by netty-grpc:

io.grpc.StatusRuntimeException: UNAVAILABLE: RST_STREAM closed stream. HTTP/2 error code: NO_ERROR
	at io.grpc.Status.asRuntimeException(Status.java:539)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)

Downstream error reported by okhttp-grpc:

io.grpc.StatusRuntimeException: INTERNAL: No error: A GRPC status of OK should have been sent
Rst Stream
	at io.grpc.Status.asRuntimeException(Status.java:537)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)
	at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:489)
	at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:453)
	at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:486)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)

No error is reported when using the Python gRPC client (which we understand to be a cython wrapper around the C++ client). We speculate that the C++ client might be handling data faster than the Java clients, and that the bug might be rooted in Envoy not being prepared to forward all remaining data for some reason.

Note that I have not validated this issue in non-gRPC contexts, but there is nothing specific about the envoy.yaml to gRPC, so it seems likely that this impacts other h2 upstream+downstream bidirectional streaming use cases as well.

Repro steps:

  • Create a simple gRPC server with a bidirectional service. Implement the server so that it sends around 4MB to the client when connected, the immediately half-closes its side of the stream. The example project below demonstrates the issue with both a grpc-java and python server implementation. Note that the error can still happen with smaller payloads, but it seems less likely.
  • A client in turn should be implemented that connects and waits for output, without half-closing, logging the bytes it receives and the eventual success/failure status. Confirm that directly connecting to the server behaves as expecting, showing all data and status received successfully.
  • A simple Envoy instance should be placed between the client and server, offering h2 to downstream users, and only connecting using h2 to upstream clusters. It may be possible to simplify further, the included example is intended to also disable a variety of timeouts.
  • Connect the client to the server through envoy. Expected behavior is that results are the same without envoy, actual observed behavior is that the client usually sees an error, despite envoy access logs seeming to show success.

To try this out, we've built a sample git repository to demonstrate the issue at https://github.com/niloc132/envoy-rst-stream. Instructions are there to run the (default) Java server and Java client, but either can be replaced with a Python implementation.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats, /clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation. http://localhost:9090/stats https://gist.github.com/niloc132/d6b5c778ab2b79b8a8c4781dc80ac40b

http://localhost:9090/clusters

server::observability_name::server
server::default_priority::max_connections::1024
server::default_priority::max_pending_requests::1024
server::default_priority::max_requests::1024
server::default_priority::max_retries::3
server::high_priority::max_connections::1024
server::high_priority::max_pending_requests::1024
server::high_priority::max_requests::1024
server::high_priority::max_retries::3
server::added_via_api::false
server::192.168.48.2:8080::cx_active::1
server::192.168.48.2:8080::cx_connect_fail::0
server::192.168.48.2:8080::cx_total::1
server::192.168.48.2:8080::rq_active::0
server::192.168.48.2:8080::rq_error::0
server::192.168.48.2:8080::rq_success::1
server::192.168.48.2:8080::rq_timeout::0
server::192.168.48.2:8080::rq_total::1
server::192.168.48.2:8080::hostname::java-server
server::192.168.48.2:8080::health_flags::healthy
server::192.168.48.2:8080::weight::1
server::192.168.48.2:8080::region::
server::192.168.48.2:8080::zone::
server::192.168.48.2:8080::sub_zone::
server::192.168.48.2:8080::canary::false
server::192.168.48.2:8080::priority::0
server::192.168.48.2:8080::success_rate::-1
server::192.168.48.2:8080::local_origin_success_rate::-1

http://localhost:9090/routes - This doesn't appear to be a valid admin URL.

http://localhost:9090/server_info - https://gist.github.com/niloc132/6516b8fcae91a7f2c4ae23024aeb5b2c

Config: See https://github.com/niloc132/envoy-rst-stream/blob/main/envoy.yaml along with the rest of the repo to reproduce this issue

Logs: Access log entry for the request, showing 4mb of pb payload sent "successfully", despite client showing error:

envoy-rst-stream-envoy-1        | [2023-10-12T16:17:40.943Z] "POST /org.example.TestService/Hello HTTP/2" 200 - 0 4000010 190 - "-" "grpc-java-netty/1.58.0" "d02b9201-cb74-4cc6-8b80-012146aafe04" "localhost:8000" "192.168.48.2:8080"

This log has level=trace, uptsream=off, dns=off, config=off (and includes the access logs with the above line: https://gist.github.com/niloc132/521a45c0eb38c7733593c56e8c06fc02 Excerpt which appears to show that envoy says it received the trailers from upstream and sent to client:

envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][router] [source/common/router/upstream_request.cc:315] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] upstream response trailers:
envoy-rst-stream-envoy-1        | 'grpc-status', '0'
envoy-rst-stream-envoy-1        | 
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][router] [source/common/router/upstream_request.cc:469] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] resetting pool request
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][client] [source/common/http/codec_client.cc:158] [Tags: "ConnectionId":"5"] request reset
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][main] [source/common/event/dispatcher_impl.cc:251] item added to deferred deletion list (size=1)
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][pool] [source/common/conn_pool/conn_pool_base.cc:215] [Tags: "ConnectionId":"5"] destroying stream: 0 remaining
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][main] [source/common/event/dispatcher_impl.cc:251] item added to deferred deletion list (size=2)
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][main] [source/common/event/dispatcher_impl.cc:251] item added to deferred deletion list (size=3)
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][http] [source/common/http/conn_manager_impl.cc:1759] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] encoding trailers via codec:
envoy-rst-stream-envoy-1        | 'grpc-status', '0'
envoy-rst-stream-envoy-1        | 
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][http] [source/common/http/conn_manager_impl.cc:221] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] doEndStream() resetting stream
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][http] [source/common/http/conn_manager_impl.cc:1798] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] stream reset: reset reason: local reset, response details: -

Wireshark screencap showing upstream (192.168.48.2:8080) sending envoy (192.168.48.3) HEADERS with end of stream set to true, then envoy sending RST_STREAM to both downstream client (192.168.48.1) and upstream. These messages are highlighted - other DATA messages from upstream->envoy and envoy->client can also be seen.

screenshot857

Conversely, here's a successful Python client (irritatingly both streams had id 1, so be sure to check source/destination IP) screenshot858

And here's a Java client also succeeding through envoy screenshot859

I can provide the pcapng file for any of these if desired, but the example project should easily reproduce this as well.

Call Stack: N/A

Other notes: Through wireshark and through setting additional custom trailers on server responses, we've ruled out the python client getting the RST_STREAM as well but incorrectly reporting success. We also tested with both the Netty and OkHttp clients and both exhibit the error, giving a degree of confidence that this is an envoy issue rather than both implementations having the same bug.

niloc132 avatar Oct 12 '23 16:10 niloc132

The 1.27.1 release with a fix for resetting streams was just brought to my attention - I tweaked the linked project to use the v1.27.1 envoy image, and can confirm that this did not fix the issue.

niloc132 avatar Oct 12 '23 17:10 niloc132

CC @yanavlasov @alyssawilk

kyessenov avatar Oct 17 '23 17:10 kyessenov

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 16 '23 20:11 github-actions[bot]

Adding comment to trigger activity. Hoping project member can take a look. Thanks!

devinrsmith avatar Nov 16 '23 21:11 devinrsmith

Sorry we haven't been able to get to this yet. It could be a codec issue or an Envoy issue and we're just now changing the underlying codec in https://github.com/envoyproxy/envoy/pull/31086. Can you see if that addresses the problem and if not I'll try to take a look?

alyssawilk avatar Nov 29 '23 14:11 alyssawilk

Is there an easy way to get a docker image for an outstanding PR? I see https://github.com/envoyproxy/envoy/tree/main/ci#ubuntu-envoy-image, but it looks like envoyproxy/envoy-build-ubuntu hasn't been updated recently.

devinrsmith avatar Nov 29 '23 16:11 devinrsmith

Maybe just keep an eye on that one - will hopefully merge this week and it'll be easier to test

alyssawilk avatar Nov 30 '23 17:11 alyssawilk

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 30 '23 20:12 github-actions[bot]

Can someone confirm that the linked PR (merged as https://github.com/envoyproxy/envoy/commit/b58d31272913fa8bb1bd1b69b0102d710c97d431) has no tagged release? It does appear that the issue happens less frequently with the current dev-tagged docker image, 717264bcc05a, but it does still happen.

I made no other changes to the test project linked aside from updating the envoyproxy/envoy tag.

niloc132 avatar Jan 02 '24 22:01 niloc132

Also worth noting, while in general Envoy will process frames in the order they were sent (headers, data, reset), if you have any filters which delay processing (pause to buffer or for an RPC) I think it's possible for the reset to end up being processed before headers and data. Could that be the issue you're seeing?

alyssawilk avatar Jan 08 '24 16:01 alyssawilk

Can you elaborate on "any filters" - do you mean on the upstream cluster ("instance")? The filters in the sample project which reproduces the issue are pretty simple (though arguably could be simpler still - websockets isn't needed/used, though I can't imagine that removing the access log or something is what you're referring to?

https://github.com/niloc132/envoy-rst-stream/blob/117744ac3ba722560d05d62a533a9edc1e6c7af5/envoy.yaml#L11-L41

It has been a few weeks, but unless I misread my own notes above, the first screenshot of wireshark shows the upstream sending DATA, DATA,... HEADERS(close), and no RST_STREAM, and envoy drops the HEADERS, instead sending RST_STREAM to both upstream and client. That is, this shouldn't be a question of envoy processing the RST_STREAM ahead of HEADERS.

With that said, my read of the grpc sequence diagrams is that an endpoint can send HEADERS(close), RST_STREAM, e.g. the server might say "here's the trailers, and I'm not accepting more messages" - but it would not be correct for the trailers to be discarded. To be clear: that is not what is happening here, but was an early theory we were considering. This may not practically exist (and if not, I'm sorry for muddying the waters here), but it seems to be expected that servers (grpc, for example) behave this way. https://httpwg.org/specs/rfc7540.html#StreamStates

niloc132 avatar Jan 08 '24 17:01 niloc132

Sorry I may not be being clear on why this is happening but I do agree there appears to be a (at least one) legit bug where stop sending can result in a response not being sent downstream. Can't make any promises but I'll see if we can find someone to tackle it. cc @krajshiva

alyssawilk avatar Jan 08 '24 18:01 alyssawilk

Is there anything else in the log preceding the receipt of trailer from upstream? Is it possible request is timing out?

yanavlasov avatar Jan 08 '24 19:01 yanavlasov

@yanavlasov Thanks for taking a look - the entire test fails in a fraction of a second, so I highly doubt there is a default timeout that low. The example linked should fail effectively instantly for each client request that is made.

The grpc service is bi-di streaming, but the server's implementation is "send a single 4MB payload and then half-close", so it functionally is a unary call - this is for the sake of simplicity, rather than required to reproduce the bug. Once all 4MB has been sent, the bug happens right away, with some dozens of DATA frames not passed to the client (and no HEADERS either, etc). The example implementation sends int32 values in the bytes payload starting from -4000000 or so and stopping at zero, so you can see in the last DATA frame that it did not include all of the expected data (i.e. the last byte wasnt 0x00000000).

niloc132 avatar Jan 08 '24 20:01 niloc132

I'm not sure when I'll have the cycles to sort this one out but I do think it's worth tackling so self-assigning mainly so I don't lose track of it

alyssawilk avatar Jan 18 '24 20:01 alyssawilk

Hi @alyssawilk,

we experienced a similar issue with grpc-js implementation using Istio. Here you can see my comments and tests: https://github.com/grpc/grpc-node/issues/2569#issuecomment-1883015207

Do you think you could take a look?

Thanks.

ianebot avatar Apr 01 '24 11:04 ianebot

We are currently looking into this problem and will update this PR once we have more info.

yanavlasov avatar May 01 '24 14:05 yanavlasov

/sub

jmarantz avatar May 02 '24 21:05 jmarantz

This may take some time to resolve as adding supporting half closed server streams is a significant change. I do not have ETA at this point.

yanavlasov avatar May 06 '24 17:05 yanavlasov

Hey @niloc132, Do you still have the pcap of the screenshot? I would like to analyze the pcap if you can share that with me. I have the same (or at least very close) issue, and I thought I identified the root cause, but today I need to confirm one more important point and the pcap could be very helpful to prove my point.

sio4 avatar May 22 '24 03:05 sio4

Sorry for my delay @sio4 - please email me at [email protected] and I will forward them to you, otherwise you can run the sample project above to reproduce.

niloc132 avatar May 29 '24 14:05 niloc132

Thanks @niloc132 for providing the pcap.

Indeed, it seems your issue is the similar to one of mine but not exactly the same. I had an investigation with trace log from grpc-java client and envoy's trace log and the pattern looks the same. Pcap is very clear to see the issues from networking perspective but sometimes trace log from the envoy or grpc library could help a lot especially for the high level inspection.

The brief symptom is as follow:

The smooth scenario should be:

  1. upstream server sends HEADER frame with END_STREAM at the end of the response so it becomes half-closed (local) state
  2. when the envoy received HEADER frame with END_STREAM from the upstream,
    1. it becomes half-closed (remote) for the upstream,
    2. sends out the last HEADER with END_STREAM to the downstream if there is no deferred frames,
    3. becomes half-closed (local) for downstream,
    4. then sends RST_STREAM to both upstream and downstream
  3. when the upstream server received the RST_STREAM while its state is half-closed (local), it closes the stream as normal
  4. when the downstream client received the RST_STREAM after it became half-closed (remote) by receiving all DATA and HEADER, it could generate an error depends on the error code.

In your pcap, I can see the downstream data streaming paused after packet number 166 (stream 3 is the downstream and stream 1 is upstream here). So at this point, I guess the envoy was not able to send data to the downstream anymore for some reason. My guess (based on my own issue) is that it could be the result of the onAboveWriteBufferHighWatermark event which means the downstream write buffer is full. After the moment, the pcap shows the upstream receiving is continued until HEADER with END_STREAM and finally the envoy sends out RST_STREAM.

image

This is slightly different from my issue. I thought my issue happens when the envoy got RST_STREAM while it still has deferred frames, and envoy interpreted RST_STREAM as abort signal (which is true generally) so it just abort all the action it should take and discard all the deferred frames. However in your case, it does not received RST_STREAM at the moment so I guess envoy just discard them when it sends RST_STREAM.

From my case, there is two points to be addressed:

  1. envoy should support half-closing: it should not send RST_STREAM after becoming half-closed (local), should wait until receiving END_STREAM from the peer.
    • [ ] #34461 is for this point.
  2. envoy should interpret RST_STREAM with error code NO_ERROR as OK since it does not mean there is any error on the response but the peer just want to close the stream. Also, it should not discard remaining deferred frames since they are still a proper data (NO_ERROR)
    1. https://datatracker.ietf.org/doc/html/rfc9113#name-error-codes
    2. https://datatracker.ietf.org/doc/html/rfc9113#name-http-message-framing

One another buggy behavior from your pcap is that the envoy sent RST_STREAM with error code NO_ERROR to the downstream () while it just discarded all deferred frames. It should not be NO_ERROR since it means "The associated condition is not a result of an error" but the fact is that the response is incorrect here.

image

So I guess envoy still think it did everything it should do correctly at this moment. IMHO, the critical issue here is this point.

<...> the python client getting the RST_STREAM as well but incorrectly reporting success. <...>

For this, I think the python implementation is half correct. Since it gets 200 OK at the beginning, and the RST_STREAM's error code is NO_ERROR, it is successful response from the HTTP/2's perspective. However, since it is gRPC, I think it should interpret the gRPC state on the trailer too, which is not available because envoy did not proxy the HEADER.

I hope we could have further investigation especially those uncertain parts. I suggest you to try getting trace log on the all components (gRPC server and client, and envoy) so we can see more detail events and decisions on them.

(By the way, it seems the patterns on the screenshot on your opening post and the pcap shared with me looks somewhat different)

sio4 avatar Jun 19 '24 08:06 sio4

When might this get tagged for release? I just hit an issue in production that looks exactly the same, after having raised our python grpc client max recv higher than the default 4MB, and going through our Envoy load balancer.

Update: We have been running the latest envoy-distroless-dev image and still encountered this problem. So it doesn't seem to fix whatever variation of the problem we are seeing.

justinfx avatar Aug 30 '24 20:08 justinfx

Did you flip the runtime guard to enable this feature? It won't work unless envoy.reloadable_features.allow_multiplexed_upstream_half_close is set true

alyssawilk avatar Sep 09 '24 15:09 alyssawilk

@alyssawilk I didn't know it was behind a feature flag! I'm going to apply that!

justinfx avatar Sep 09 '24 19:09 justinfx

I believe I am still observing this issue with the patched version (but would appreciate if someone could validate my method):

  • I launched an instance of istio-testing/proxyv2:1.24-alpha.ceeeab0e900812087dae52d851584a73d5939290-distroless (a version of the Istio proxy containing the patched version of Envoy).
  • I added a static_layer with "envoy.reloadable_features.allow_multiplexed_upstream_half_close": true
  • I am still observing (13 INTERNAL: Received RST_STREAM with code 0 (Call ended without gRPC status))

DerekTBrown avatar Sep 11 '24 16:09 DerekTBrown

Any chance you could provide a repro case and/or logs? There are definitely many reasons things can go south and while I'm unsure if there are other failure modes which would cause this particular error, it'd definitely be helpful to have the additional data to better assess.

alyssawilk avatar Sep 11 '24 17:09 alyssawilk

I too would love to see a repro. We still don't know how Envoy manages to get into this state in production. But so far we have been running ok since Sunday with a combination of:

  • dev envoy with this runtime feature fix enabled
  • remove overrides for http2_protocol_options settings
  • using a max_connection_duration of 300s for upstream cluster

A repro would be cool since we don't know if we have fixed it or are just lucky so far.

It does sound similar to envoy issue #8114

justinfx avatar Sep 11 '24 19:09 justinfx

@justinfx @alyssawilk I spent some time digging into this. It looks like there were >1 reasons why we were seeing missing HEADERS in our environment. The runtime feature fix enabled seems to have reduced the error rate, but I am seeing grpc-go send STREAM_RST without trailers as well:

https://github.com/grpc/grpc-go/issues/7623

DerekTBrown avatar Sep 11 '24 21:09 DerekTBrown

Glad this has reduced your error rate. I don't think the information in the attached issue is enough for us to repro. Any chance again we could get more detailed info and/or logs? I'm happy to supply you with email addresses if there's info you can share but don't want posted on the public internet

alyssawilk avatar Sep 16 '24 13:09 alyssawilk