akka-http icon indicating copy to clipboard operation
akka-http copied to clipboard

Finished streams lingering in `HalfClosedRemoteSendingData` state leading to `REFUSED_STREAM`

Open fredfp opened this issue 2 years ago • 9 comments

We might be hitting an issue in akka-http (10.2.7 and 10.2.9) when using akka-grpc: on the server side, some requests seem to never fully finish and all new requests are being refused (REFUSED_STREAM because activeStreamCount() > settings.maxConcurrentStreams).

All we have is a heap dump from a time where the issue was likely. Analysing the heap dump, we notice that all streams in Http2StreamHandling#streamStates are in state HalfClosedRemoteSendingData, and their respective outStream seem to indicate that everything is finished (buffer is empty, upstreamClosed is true, endStreamSent is true, trailer is set, maybeinlet is null).

Looking at the code I would expect those streams to have transitioned to Closed state and removed from the streamStates map, they are not and I don't understand why.

We didn't manage to reproduce in isolation, and I doubt we can afford to run in debug mode for days until the issues pops up again. What would you advise we do to understand the issue better?

fredfp avatar Mar 23 '22 15:03 fredfp

Is it possible to go through the following states on the server side: Idle -> OpenReceivingDataFirst -> Open -> HalfClosedRemoteSendingData? If yes, can it happen that we land in HalfClosedRemoteSendingData with outStream finishing at the same time but the proper transition to Closed doesn't happen?

fredfp avatar Mar 24 '22 09:03 fredfp

Thanks for the great report, @fredfp. I'll have a look.

jrudolph avatar Mar 24 '22 09:03 jrudolph

My pleasure, thanks for having a look! As mentioned on gitter, I can't share the heap dump, but I'd be happy to do some screensharing or further data extraction if that could help. Just let me know.

fredfp avatar Mar 24 '22 09:03 fredfp

What kind of gRPC requests are involved in that case? Are they streaming or request/response style?

jrudolph avatar Mar 24 '22 10:03 jrudolph

This kind of call stack could be a problem where we the state machine is entered twice on the same call stack so that the inner state change is lost.

image

jrudolph avatar Mar 24 '22 10:03 jrudolph

We mostly have request/response style, and a single reqeust with streaming response. From the heap-dump I assumed the issue was with a request/response style given that maybeinlet had not been set (and is never cleared in the code).

fredfp avatar Mar 24 '22 10:03 fredfp

Thanks!

The previous call stack would not be a problem as a gRPC response usually has multiple frames so that pullNextFrame would not even want to close stream immediately.

Next conjecture to test: unexpected timing of incoming WINDOW_UPDATE frames using this call stack:

image

jrudolph avatar Mar 24 '22 10:03 jrudolph

@fredfp do you know how big your response data is? I can reproduce the issue when the response is bigger than the INITIAL_WINDOW_SIZE. What's your client and does it have any extra config on the WINDOW size settings?

jrudolph avatar Mar 24 '22 11:03 jrudolph

Our client is akka-grpc, going through a couple of linkerd proxies. We don't touch any setting related to window size. Our responses are of varying sizes, and I'm pretty sure they can be bigger than 64ko (which IIRC is the default initial window size).

fredfp avatar Mar 24 '22 11:03 fredfp