grpc-go
grpc-go copied to clipboard
gRPC connection closing with unexpected EOF on high-latency connections
What version of gRPC are you using?
1.46.2
What version of Go are you using (go version
)?
go version go1.18 darwin/amd64
What operating system (Linux, Windows, …) and version?
Client is running MacOs: ProductName: macOS ProductVersion: 11.6 BuildVersion: 20G165
Server is running Linux (CentOS 7)
What did you do?
Note: I have posted a small repo that can be used to reproduce this bug, please check it out here: https://github.com/brianneville/grpcbug
For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error:
code = Internal desc = unexpected EOF
This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency. For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called. For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.
I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:
1. Client side - configure window sizes
The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server. That is, setting the DialOptions on the client side to use:
opts := []grpc.DialOption{
grpc.WithInitialWindowSize(largerWindowSize),
grpc.WithInitialConnWindowSize(largerWindowSize),
}
Where both:
-
largerWindowSize
is greater than 65535 (so that dynamic window/flow estimation is turned off) -
largerWindowSize
is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).
2. Server-side - delay ending the stream
The error can be prevented by delaying the calls that write the StatusOk
into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf
at google.golang.org/grpc/internal/transport/http2_server.go#finishStream
.
That is, you can make any of the following changes to *http2Server.finishStream
, and the error will not be present:
time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
t.controlBuf.put(hdr)
or
go func() { // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
time.Sleep(1 * time.Second)
t.controlBuf.put(hdr)
}()
When the t.controlBuf.put(hdr)
line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.
Note if you add the sleep after the t.controlBuf.put(hdr)
, then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)
t.controlBuf.put(hdr)
time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs
Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?
What did you expect to see?
Connection latency does not affect RPCs
What did you see instead?
High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF
Hello, thank you for bringing up this issue/potential bug. In the current state, the team is unable to reproduce this. This requires pinging a remote peer running Centos 7, configured for a high ping. Could you please try reproducing this on the same machine, or even better, the same binary. grpc-go provides a latency package (https://pkg.go.dev/google.golang.org/grpc/benchmark/latency) that provides a way to inject real world latency into network connections. Reproducing in this way would a. Let the team reproduce it ourselves, and b. Get rid of all the other variables that may have caused this bug, and narrow the system under question down to our grpc code and the tcp stack it runs on top of.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi,
I wrote a test using the latency package as you described, but unfortunately as of yet I can't seem to reproduce the bug with this test. I wrote the test using the same RPC that I have been using to trigger the bug - https://github.com/brianneville/grpcbug This test (at the time of this comment) is available here: https://github.com/brianneville/grpc-go/commit/e1ce0e5f2b19a4680ee1b4c6735316bc6bcf15f8
I also recorded some debug logs showing the case where the bug hits in contrast to the case where it does not hit. https://github.com/brianneville/grpcbug/tree/master/logs It's interesting that the bug only seems to hit when the server is on the remote (Centos) machine - hosting the server on my local (Mac) machine and attempting the RPC does not produce this EOF error.
Something that may be relevant here is that the remote machine is very powerful relative to the local machine. I wonder whether this would allow the remote machine to experience some unanticipated interleaving of goroutines (reading/writing) that could contribute to the error along with the high ping.
My worry is that there's something else in the system impacting it like a firewall or proxy. It might be helpful to inspect the traffic with something like wireshark.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi,
The Centos remote machine is behind a VPN, although it's worth noting that this issue can be hit even when the client is on the same VPN (it's just a lot less frequent, presumably because the network connection is better)
Other points of note regarding the architecture(s) required to reproduce this issue: The issue is reproducible when running the https://github.com/brianneville/grpcbug server on a Debian machine (tested with a Debian container on the Centos remote). It is also reproducible when the client is either a Mac or Ubuntu machine.
I believe we're also seeing this issue, but there is not high-latency as both the client and server are running in the same process. It's rare (1 out of every 3000-5000 attempts) but we've seen it on both macOS and Linux. Setting a large window size does seem to resolve the issue.
In our case, it's just a Go test that spins up a server and client and makes a simple (non-streaming) RPC.
@zmb3
TestListResources
appears to be failing due to a different problem - although it is weird that the window-size workaround for this bug can also be applied to it, so its still worth mentioning here.
I wrote a small unit test that minimally reproduces the TestListResources
error seen in gravitational/teleport#13548 and added info on the bug in the commit message: https://github.com/brianneville/grpcbug/commit/2b04ba8946af96e6e511174fd0753f5832b35fc4 (briefly, the client is slow to cleanup resources after recieving responses which exceed its grpc.max_receive_message_length
, and so tells the server to terminate the stream when it cannot allocate resources to a followup response)
I haven't yet looked into the TestGRPCErrorWrapping
failure mentioned at https://github.com/gravitational/teleport/issues/9492#issuecomment-1159151779
Thank you @brianneville! Appreciate you putting together a smaller reproduction - it was on my list of things to do.
Thanks for providing a smaller test case that reproduces the error seen in TestListResources. I was able to reproduce this error locally. Looking into it further now, interesting that it's also fixed with the window-size workaround. However, the orthogonal issue this issue was originally filed about is still too much for the team to reproduce (with the requirements of the machine the server is running on).
Thank you so much for getting this issue root-caused and fixed in the end! Really truly appreciate it!
Yeah, thank you so much for finding this issue! Huge correctness issue within our repository, glad it got caught/fixed :). Weird that it didn't come up sooner.
Hi,
Really sorry to resurrect this thread again but the issue still seems to persist.
That is to say, following the original instructions to reproduce this error (seen in https://github.com/brianneville/grpcbug#readme), still leads to an EOF, even when using the version of grpc-go which has https://github.com/grpc/grpc-go/pull/5494 in it.
This #5494 did however fix the issue reproduced with the test https://github.com/brianneville/grpcbug/blob/master/related/teleport_test.go, but it looks like this issue just coincidentally shared an identical symptom and workaround.
Can you give me any advice on how to proceed with debugging this? You mention that it can't be reproduced on your hardware/setup, but is there anything that you would like me to try on my setup which might help to better understand what could be going on here?
Hey, sorry to see that this still persists. My biggest recommendation is to throw some logs around the http2_client and try and figure out what is triggering the EOF, similar to how we derived that new streams weren't getting the correct flow control number for the root cause of the failure for the teleport tests.
@zasweq curious if we think it could be related to this bug in net/http2[1]? Could you help drive priority? This is causing huge problems for us in production with golang 1.18.
[1] https://github.com/golang/go/issues/53253 <--updated
cc @dfawley please reopen?
Hi,
Really sorry to resurrect this thread again but the issue still seems to persist. That is to say, following the original instructions to reproduce this error (seen in https://github.com/brianneville/grpcbug#readme), still leads to an EOF, even when using the version of grpc-go which has #5494 in it.
This #5494 did however fix the issue reproduced with the test https://github.com/brianneville/grpcbug/blob/master/related/teleport_test.go, but it looks like this issue just coincidentally shared an identical symptom and workaround.
Can you give me any advice on how to proceed with debugging this? You mention that it can't be reproduced on your hardware/setup, but is there anything that you would like me to try on my setup which might help to better understand what could be going on here?
@brianneville were you using go 1.18? if yes we are also trying to see if this could be a possible regression go 1.17 -> 1.18
@zasweq curious if we think it could be related to this bug in net/http2[1]? Could you help drive priority? This is causing huge problems for us in production with golang 1.18.
Unlikely, since grpc-go uses its own HTTP server implementation, and just the framer part of the net/http2 library. ~~But your link to the net/http2 issue is a link back to this issue, so I can't see what you are referencing.~~
EDIT: Oh sorry, I also see your back-reference from that bug to this one. Still unlikely for the same reason, but I'll take a look.
Anecdotal evidence only, but I don't recall seeing these failures prior to our Go 1.18 upgrade..
https://github.com/golang/go/issues/53253 appears completely unrelated: 1. We don't use the same client/server implementation, and 2. the symptoms are different; that was a 5s delay caused by a delay in receipt of a frame with END_STREAM set -- this appears to be streams failing due to flow control violations (either the sender using more quota than it had or the recipient incorrectly believing that was the case).
To add a little more context. We are seeing this error with
grpc-go 1.48.0 golang 1.18
rpc error: code = Internal desc = unexpected EOF"
client
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(math.MaxInt),
grpc.MaxCallSendMsgSize(math.MaxInt),
)
server
[]grpc.ServerOption{
grpc.MaxRecvMsgSize(math.MaxInt),
grpc.MaxSendMsgSize(math.MaxInt),
}
RPC duration is > 150ms but the connection is local (tcp).
For the sake eliminating all variables we are going to test rolling back go to 1.17.
I compiled and ran the server and client for the repo that reproduces this bug (https://github.com/brianneville/grpcbug) using go1.17.6 and the issue is still present, so I'd be hesitant to say that https://github.com/golang/go/issues/53253 is at fault here.
Jus to explore further though @hexfusion could you try appending the following DialOptions to the client?
var opts []grpc.DialOption
// add any other DialOptions...
const maxWindowSize int32 = (1 << 20) * 16
opts = append(opts,
grpc.WithInitialWindowSize(maxWindowSize),
grpc.WithInitialConnWindowSize(maxWindowSize),
)
These DialOptions are a workaround for this bug (https://github.com/grpc/grpc-go/issues/5358) so it would be interesting to see if they have any effect in your case.
Do you think that it would help even if in the case that we are setting Max to math.MaxInt
as per above?
The options grpc.WithInitialWindowSize
and grpc.WithInitialConnWindowSize
options are independent of the grpc.MaxCallRecvMsgSize
and grpc.MaxCallSendMsgSize
options. You can combine these options if you wish (I'd say try it, because I'm curious to see if the workaround has any effect in your case).
Setting grpc.WithInitialWindowSize
and grpc.WithInitialConnWindowSize
will disable BDP and manually force the window size used to whatever value you provide it.
By comparison grpc.MaxCallRecvMsgSize
and grpc.MaxCallSendMsgSize
are for only setting the maximum size of the overall message that can be received/sent (with this message being sent in multiple data frames using a dynamic window size unless BDP is disabled).
One more bit of context we see this right after GOAWAY
from MaxAge
. In a heavy traffic application. Generally speaking we are looking at about ~35000/TPS and the issue reproduces over a 48 hour window. The logs right before
2022/08/06 01:54:20 INFO: [transport] transport: error closing conn during Close: close tcp 127.0.0.1:44067->127.0.0.1:51234: use of closed network connection
2022/08/06 01:54:20 WARNING: [transport] transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:44067->127.0.0.1:51234: use of closed network connection
2022/08/06 01:54:20 ERROR: [transport] transport: loopyWriter.run returning. Err: transport: Connection closing
2022/08/06 01:54:20 INFO: [core] [Channel #4] Channel Connectivity change to CONNECTING
2022/08/06 01:54:20 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc00053f330, {CONNECTING <nil>}
2022/08/06 01:54:20 INFO: [core] [Channel #4 SubChannel #5] Subchannel picks a new address "127.0.0.1:44067" to connect
2022/08/06 01:54:20 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to CONNECTING
2022/08/06 01:54:20 INFO: [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022/08/06 01:54:20 INFO: [core] [Channel #4] Channel Connectivity change to IDLE
2022/08/06 01:54:20 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc00053f330, {IDLE <nil>}
2022/08/06 01:54:20 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to IDLE
If the connection is lost in the middle of an RPC then it's expected to get RPC failures.
You could use a MaxAgeGrace
setting to help work around that issue when it's caused by MaxAge
connection termination from the server.
Thanks for the reply we currently have MaxAgeGrace
set to math.MaxInt64
to circumvent forceful closure of the stream. The result of those force closures was observed as.
rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR
After we bumped MaxAgeGrace
we now see the reported rpc error: code = Internal desc = unexpected EOF
How long after the initial GOAWAY are you seeing those EOF errors? Are you still pretty sure it's related? Does it still only reproduce every ~48 hours.
unexpected EOF
is right after GOAWAY
which made me feel like it was a possible issue with transparent retry but have not proved it. It ranges between 48 -> 72 hours but has been consistent. Also FTR it was reproduced on go 1,17 which at this point is not that surprising but just in case folks see this later.
I will create a new issue for this but the whole reason we are using ServerMaxConnectionAge
is because we would overflow the stream id every now and then. invalid stream ID
. So maybe its related.
all unary requests FWIW.