grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

gRPC connection closing with unexpected EOF on high-latency connections

Open brianneville opened this issue 2 years ago • 38 comments

What version of gRPC are you using?

1.46.2

What version of Go are you using (go version)?

go version go1.18 darwin/amd64

What operating system (Linux, Windows, …) and version?

Client is running MacOs: ProductName: macOS ProductVersion: 11.6 BuildVersion: 20G165

Server is running Linux (CentOS 7)

What did you do?

Note: I have posted a small repo that can be used to reproduce this bug, please check it out here: https://github.com/brianneville/grpcbug


For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error: code = Internal desc = unexpected EOF

This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency. For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called. For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.

I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:

1. Client side - configure window sizes

The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server. That is, setting the DialOptions on the client side to use:

    opts := []grpc.DialOption{
       grpc.WithInitialWindowSize(largerWindowSize),
       grpc.WithInitialConnWindowSize(largerWindowSize),
    }

Where both:

  • largerWindowSize is greater than 65535 (so that dynamic window/flow estimation is turned off)
  • largerWindowSize is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).

2. Server-side - delay ending the stream

The error can be prevented by delaying the calls that write the StatusOk into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf at google.golang.org/grpc/internal/transport/http2_server.go#finishStream. That is, you can make any of the following changes to *http2Server.finishStream, and the error will not be present:

    time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
    t.controlBuf.put(hdr)

or

    go func() {         // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
      time.Sleep(1 * time.Second)
      t.controlBuf.put(hdr)
    }()

When the t.controlBuf.put(hdr) line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.

Note if you add the sleep after the t.controlBuf.put(hdr), then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)

    t.controlBuf.put(hdr)
    time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs 

Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?

What did you expect to see?

Connection latency does not affect RPCs

What did you see instead?

High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF

brianneville avatar May 16 '22 20:05 brianneville

Hello, thank you for bringing up this issue/potential bug. In the current state, the team is unable to reproduce this. This requires pinging a remote peer running Centos 7, configured for a high ping. Could you please try reproducing this on the same machine, or even better, the same binary. grpc-go provides a latency package (https://pkg.go.dev/google.golang.org/grpc/benchmark/latency) that provides a way to inject real world latency into network connections. Reproducing in this way would a. Let the team reproduce it ourselves, and b. Get rid of all the other variables that may have caused this bug, and narrow the system under question down to our grpc code and the tcp stack it runs on top of.

zasweq avatar May 25 '22 22:05 zasweq

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Jun 01 '22 01:06 github-actions[bot]

Hi,

I wrote a test using the latency package as you described, but unfortunately as of yet I can't seem to reproduce the bug with this test. I wrote the test using the same RPC that I have been using to trigger the bug - https://github.com/brianneville/grpcbug This test (at the time of this comment) is available here: https://github.com/brianneville/grpc-go/commit/e1ce0e5f2b19a4680ee1b4c6735316bc6bcf15f8

I also recorded some debug logs showing the case where the bug hits in contrast to the case where it does not hit. https://github.com/brianneville/grpcbug/tree/master/logs It's interesting that the bug only seems to hit when the server is on the remote (Centos) machine - hosting the server on my local (Mac) machine and attempting the RPC does not produce this EOF error.

Something that may be relevant here is that the remote machine is very powerful relative to the local machine. I wonder whether this would allow the remote machine to experience some unanticipated interleaving of goroutines (reading/writing) that could contribute to the error along with the high ping.

brianneville avatar Jun 06 '22 15:06 brianneville

My worry is that there's something else in the system impacting it like a firewall or proxy. It might be helpful to inspect the traffic with something like wireshark.

dfawley avatar Jun 06 '22 16:06 dfawley

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Jun 12 '22 18:06 github-actions[bot]

Hi,

The Centos remote machine is behind a VPN, although it's worth noting that this issue can be hit even when the client is on the same VPN (it's just a lot less frequent, presumably because the network connection is better)

Other points of note regarding the architecture(s) required to reproduce this issue: The issue is reproducible when running the https://github.com/brianneville/grpcbug server on a Debian machine (tested with a Debian container on the Centos remote). It is also reproducible when the client is either a Mac or Ubuntu machine.

brianneville avatar Jun 18 '22 20:06 brianneville

I believe we're also seeing this issue, but there is not high-latency as both the client and server are running in the same process. It's rare (1 out of every 3000-5000 attempts) but we've seen it on both macOS and Linux. Setting a large window size does seem to resolve the issue.

In our case, it's just a Go test that spins up a server and client and makes a simple (non-streaming) RPC.

zmb3 avatar Jun 21 '22 14:06 zmb3

@zmb3 TestListResources appears to be failing due to a different problem - although it is weird that the window-size workaround for this bug can also be applied to it, so its still worth mentioning here.

I wrote a small unit test that minimally reproduces the TestListResources error seen in gravitational/teleport#13548 and added info on the bug in the commit message: https://github.com/brianneville/grpcbug/commit/2b04ba8946af96e6e511174fd0753f5832b35fc4 (briefly, the client is slow to cleanup resources after recieving responses which exceed its grpc.max_receive_message_length, and so tells the server to terminate the stream when it cannot allocate resources to a followup response)

I haven't yet looked into the TestGRPCErrorWrapping failure mentioned at https://github.com/gravitational/teleport/issues/9492#issuecomment-1159151779

brianneville avatar Jun 22 '22 13:06 brianneville

Thank you @brianneville! Appreciate you putting together a smaller reproduction - it was on my list of things to do.

zmb3 avatar Jun 22 '22 14:06 zmb3

Thanks for providing a smaller test case that reproduces the error seen in TestListResources. I was able to reproduce this error locally. Looking into it further now, interesting that it's also fixed with the window-size workaround. However, the orthogonal issue this issue was originally filed about is still too much for the team to reproduce (with the requirements of the machine the server is running on).

zasweq avatar Jun 29 '22 23:06 zasweq

Thank you so much for getting this issue root-caused and fixed in the end! Really truly appreciate it!

brianneville avatar Jul 13 '22 15:07 brianneville

Yeah, thank you so much for finding this issue! Huge correctness issue within our repository, glad it got caught/fixed :). Weird that it didn't come up sooner.

zasweq avatar Jul 13 '22 15:07 zasweq

Hi,

Really sorry to resurrect this thread again but the issue still seems to persist.
That is to say, following the original instructions to reproduce this error (seen in https://github.com/brianneville/grpcbug#readme), still leads to an EOF, even when using the version of grpc-go which has https://github.com/grpc/grpc-go/pull/5494 in it.

This #5494 did however fix the issue reproduced with the test https://github.com/brianneville/grpcbug/blob/master/related/teleport_test.go, but it looks like this issue just coincidentally shared an identical symptom and workaround.

Can you give me any advice on how to proceed with debugging this? You mention that it can't be reproduced on your hardware/setup, but is there anything that you would like me to try on my setup which might help to better understand what could be going on here?

brianneville avatar Aug 02 '22 15:08 brianneville

Hey, sorry to see that this still persists. My biggest recommendation is to throw some logs around the http2_client and try and figure out what is triggering the EOF, similar to how we derived that new streams weren't getting the correct flow control number for the root cause of the failure for the teleport tests.

zasweq avatar Aug 02 '22 18:08 zasweq

@zasweq curious if we think it could be related to this bug in net/http2[1]? Could you help drive priority? This is causing huge problems for us in production with golang 1.18.

[1] https://github.com/golang/go/issues/53253 <--updated

hexfusion avatar Aug 09 '22 10:08 hexfusion

cc @dfawley please reopen?

hexfusion avatar Aug 09 '22 10:08 hexfusion

Hi,

Really sorry to resurrect this thread again but the issue still seems to persist. That is to say, following the original instructions to reproduce this error (seen in https://github.com/brianneville/grpcbug#readme), still leads to an EOF, even when using the version of grpc-go which has #5494 in it.

This #5494 did however fix the issue reproduced with the test https://github.com/brianneville/grpcbug/blob/master/related/teleport_test.go, but it looks like this issue just coincidentally shared an identical symptom and workaround.

Can you give me any advice on how to proceed with debugging this? You mention that it can't be reproduced on your hardware/setup, but is there anything that you would like me to try on my setup which might help to better understand what could be going on here?

@brianneville were you using go 1.18? if yes we are also trying to see if this could be a possible regression go 1.17 -> 1.18

hexfusion avatar Aug 09 '22 13:08 hexfusion

@zasweq curious if we think it could be related to this bug in net/http2[1]? Could you help drive priority? This is causing huge problems for us in production with golang 1.18.

Unlikely, since grpc-go uses its own HTTP server implementation, and just the framer part of the net/http2 library. ~~But your link to the net/http2 issue is a link back to this issue, so I can't see what you are referencing.~~

EDIT: Oh sorry, I also see your back-reference from that bug to this one. Still unlikely for the same reason, but I'll take a look.

dfawley avatar Aug 09 '22 15:08 dfawley

Anecdotal evidence only, but I don't recall seeing these failures prior to our Go 1.18 upgrade..

zmb3 avatar Aug 09 '22 15:08 zmb3

https://github.com/golang/go/issues/53253 appears completely unrelated: 1. We don't use the same client/server implementation, and 2. the symptoms are different; that was a 5s delay caused by a delay in receipt of a frame with END_STREAM set -- this appears to be streams failing due to flow control violations (either the sender using more quota than it had or the recipient incorrectly believing that was the case).

dfawley avatar Aug 09 '22 16:08 dfawley

To add a little more context. We are seeing this error with

grpc-go 1.48.0 golang 1.18

rpc error: code = Internal desc = unexpected EOF"

client

grpc.WithDefaultCallOptions(
  grpc.MaxCallRecvMsgSize(math.MaxInt),
  grpc.MaxCallSendMsgSize(math.MaxInt),
)

server

[]grpc.ServerOption{
  grpc.MaxRecvMsgSize(math.MaxInt),
  grpc.MaxSendMsgSize(math.MaxInt),
}

RPC duration is > 150ms but the connection is local (tcp).

For the sake eliminating all variables we are going to test rolling back go to 1.17.

hexfusion avatar Aug 09 '22 17:08 hexfusion

I compiled and ran the server and client for the repo that reproduces this bug (https://github.com/brianneville/grpcbug) using go1.17.6 and the issue is still present, so I'd be hesitant to say that https://github.com/golang/go/issues/53253 is at fault here.

Jus to explore further though @hexfusion could you try appending the following DialOptions to the client?

var opts []grpc.DialOption

// add any other DialOptions...

const maxWindowSize int32 = (1 << 20) * 16
opts = append(opts,
    grpc.WithInitialWindowSize(maxWindowSize),
    grpc.WithInitialConnWindowSize(maxWindowSize),
)

These DialOptions are a workaround for this bug (https://github.com/grpc/grpc-go/issues/5358) so it would be interesting to see if they have any effect in your case.

brianneville avatar Aug 09 '22 18:08 brianneville

Do you think that it would help even if in the case that we are setting Max to math.MaxInt as per above?

hexfusion avatar Aug 09 '22 18:08 hexfusion

The options grpc.WithInitialWindowSize and grpc.WithInitialConnWindowSize options are independent of the grpc.MaxCallRecvMsgSize and grpc.MaxCallSendMsgSize options. You can combine these options if you wish (I'd say try it, because I'm curious to see if the workaround has any effect in your case).

Setting grpc.WithInitialWindowSize and grpc.WithInitialConnWindowSize will disable BDP and manually force the window size used to whatever value you provide it. By comparison grpc.MaxCallRecvMsgSize and grpc.MaxCallSendMsgSize are for only setting the maximum size of the overall message that can be received/sent (with this message being sent in multiple data frames using a dynamic window size unless BDP is disabled).

brianneville avatar Aug 09 '22 19:08 brianneville

One more bit of context we see this right after GOAWAY from MaxAge . In a heavy traffic application. Generally speaking we are looking at about ~35000/TPS and the issue reproduces over a 48 hour window. The logs right before

2022/08/06 01:54:20 INFO: [transport] transport: error closing conn during Close: close tcp 127.0.0.1:44067->127.0.0.1:51234: use of closed network connection
2022/08/06 01:54:20 WARNING: [transport] transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:44067->127.0.0.1:51234: use of closed network connection
2022/08/06 01:54:20 ERROR: [transport] transport: loopyWriter.run returning. Err: transport: Connection closing
2022/08/06 01:54:20 INFO: [core] [Channel #4] Channel Connectivity change to CONNECTING
2022/08/06 01:54:20 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc00053f330, {CONNECTING <nil>}
2022/08/06 01:54:20 INFO: [core] [Channel #4 SubChannel #5] Subchannel picks a new address "127.0.0.1:44067" to connect
2022/08/06 01:54:20 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to CONNECTING
2022/08/06 01:54:20 INFO: [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022/08/06 01:54:20 INFO: [core] [Channel #4] Channel Connectivity change to IDLE
2022/08/06 01:54:20 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc00053f330, {IDLE <nil>}
2022/08/06 01:54:20 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to IDLE

hexfusion avatar Aug 09 '22 19:08 hexfusion

If the connection is lost in the middle of an RPC then it's expected to get RPC failures.

You could use a MaxAgeGrace setting to help work around that issue when it's caused by MaxAge connection termination from the server.

dfawley avatar Aug 09 '22 20:08 dfawley

Thanks for the reply we currently have MaxAgeGrace set to math.MaxInt64 to circumvent forceful closure of the stream. The result of those force closures was observed as.

rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR

After we bumped MaxAgeGrace we now see the reported rpc error: code = Internal desc = unexpected EOF

hexfusion avatar Aug 09 '22 21:08 hexfusion

How long after the initial GOAWAY are you seeing those EOF errors? Are you still pretty sure it's related? Does it still only reproduce every ~48 hours.

dfawley avatar Aug 16 '22 17:08 dfawley

unexpected EOF is right after GOAWAY which made me feel like it was a possible issue with transparent retry but have not proved it. It ranges between 48 -> 72 hours but has been consistent. Also FTR it was reproduced on go 1,17 which at this point is not that surprising but just in case folks see this later.

hexfusion avatar Aug 16 '22 19:08 hexfusion

I will create a new issue for this but the whole reason we are using ServerMaxConnectionAge is because we would overflow the stream id every now and then. invalid stream ID. So maybe its related.

all unary requests FWIW.

hexfusion avatar Aug 16 '22 19:08 hexfusion