grpc-go
grpc-go copied to clipboard
TCP Connection reset after a single lost keepalive packet
What version of gRPC are you using?
1.54.0
What version of Go are you using (go version)?
go1.20.2 linux/amd64
What operating system (Linux, Windows, …) and version?
Linux 6.2
Bug Report
The configuration of TCP_USER_TIMEOUT to 20 seconds in #2307 and #5219 together with go default tcp keepalive interval of just 15 seconds results in a tcp connection being reset after a single lost keepalive packet.
Lost packets of a tcp connection are normally being re-transmitted after a short amount of time, well within the 20 seconds timeout. But tcp keepalive packets are not being re-transmitted (ACK segments that contain no data are not reliably transmitted by TCP). Therefore the timeout is reached after just a single lost packet.
Normally not re-transmitting tcp keepalive packets is fine as the connection is only reseted after TCP_KEEPCNT(default=9) lost keepalive packets.
Test
TCPDump of a test grpc connection (disabled keepalives on the client to reduce packet count, the same issue can be reproduced with default keepalives on both the server and client):

- After packet 199 I dropped all traffic TO port 8090
sudo iptables -I INPUT -p tcp --dport 8090 -j DROP
- Packet 200 gets received by the client and answered in packet 201
- Packet 201 is only visible in the tcpdump but does not get received by the server
- Packet 202 resets the connection by the server when the next keepalive would have been send
Increasing the TCP_USER_TIMEOUT to 50 seconds results in the connection only being reset after 3 lost keepalive packets.
grpc.KeepaliveParams(
keepalive.ServerParameters{
Timeout: 50 * time.Second,
},
),
Note: Here "keepalive" refers to the grpc/http2 keepalive mechanism and timeout configuration, not the tcp keepalives.
Our keepalive implementation is based on proposals A8 and A9. Any changes here would need to made across grpc implementations. If you want to propose any changes to this spec, you can do on the grpc-proposal repository.
Lost TCP keepalive packets might not be re-transmitted. But gRPC keepalives use a HTTP/2 PING frame. Why are these getting lost and not retransmitted?
Another thing to note is that gRPC sets the TCP keepalive timeout only when gRPC keepalive Time is not configured. See the code here: https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_client.go#L264. This is the keepalive configuration struct on the client side: https://pkg.go.dev/google.golang.org/[email protected]/keepalive#ClientParameters.
Can you share you gRPC keepalive configuration on the client and server side? Thanks.
Other grpc implementations are most likely not affected by the same issue.
The Design Background of A8 states that tcp keepalives are disabled by default, but when enabled after an implementation-specific duration of inactivity (on the order of 1-2 hours, but sysadmin configurable).
This is probably based on RFC1122 This interval MUST be configurable and MUST default to no less than two hours.
With this high interval, tcp keepalives are likely never been send out because other keepalive mechanisms in grpc/http2 send data more regularly. These keealives are normal tcp packets and are been re-transmitted when a single packet drops.
But TCP connections on GO default to a very short keepalive interval of just 15 seconds. Therefore this issue is only present in the go grpc implementation.
I am using the default keepalive settings on both the server and client side. Thats why i think that this issue should at least be documented, or better "fixed" by setting more forgiving default settings.
Thanks for the explanation @Lucaber
Do you see the problem even in the case where you set both keepalive.ClientParameters.Time and keepalive.ClientParameters.Timeout on the client and keepalive.ServerParameters.Time on the server to non-default values?
And since you are seeing the TCP connection getting closed after a single lost keepalive frame, are you saying that TCP_KEEPCNT=9 is not taking effect?
Correct, TCP_KEEPCNT=9 is not taking effect.
This depends on the values configured. The client and server configuration can mostly be viewed independently as the same issue is happening in both directions.
Setting the Timeout value controls how many keepalive packets can be dropped. Setting this value to 50 seconds, allows 2 lost packets. The connection gets reset after the third dropped packet.
Setting the Time value
- to
time.Duration(math.MaxInt64)disables the configuration ofTCP_USER_TIMEOUTand solves the problem. Without aTCP_USER_TIMEOUT, the connecting now times-out only after 9(TCP_KEEPCNT)*15 seconds - to a value higher than 15 seconds still enables the TCP keepalive mechanism and the issue persists.
- to a value lower than 15 seconds effectively disables tcp keepalives, because the connection never goes idle. (At least when a grpc call is running and the connection is being used). But the active connection results in an
ENHANCE_YOUR_CALMerror after a few grpc-keepalives. Im not sure where i can configure this behavior.
Another way to fix this problem is disabling (or setting to 2 hours) the tcp keepalive interval. This can only be done be the user because grpc-go does not initialize the tcp socket:
var lc net.ListenConfig
lc.KeepAlive = -1
listener, err := lc.Listen(context.Background(), "tcp", "0.0.0.0:8090")
grpc.WithContextDialer(func(ctx context.Context, s string) (net.Conn, error) {
var d net.Dialer
d.KeepAlive = -1
return d.Dial("tcp", s)
}),
But the active connection results in an ENHANCE_YOUR_CALM error after a few grpc-keepalives. Im not sure where i can configure this behavior.
This can be controlled by setting the MinTime field of server enforcement policy defined here: https://pkg.go.dev/google.golang.org/[email protected]/keepalive#EnforcementPolicy. But you shouldn't have to set the value to such a low one to get around this problem.
If the client sets keepalive.ClientParameters.Time to whatever value (except infinity) and set keepalive.ClientParameters.Timeout to a value greater than 9*15s, that would ensure correct keepalive behavior, wouldn't it?
Apart from documenting this, do you have any other suggestions for handling this issue? Thanks.
Sorry, I just noticed that the default Time of the client side is defaulted to infinity, meaning the TCP_USER_TIMEOUT is actually only set on the server side by default. So the described issue only happens with lost keepalives from the server side.
Otherwise correct, setting the keepalive.ClientParameters.Timeout to a value grater than 9*15s enforces correct keepalive behavior even when keepalive.ClientParameters.Time is set to a different value.
Applying this to the server side keepalive.ServerParameters.Timeout fixes the problem.
Im not sure how to handle this problem correctly. The best way to solve this issue according to the spec is setting the correct TCP keepalive interval. But this can only be set by the user explicitly and is not controlled by the library.
Another way would be to increase the Timeout for example to the mentioned 50 seconds. This change would introduce an additional implementation difference from the grpc spec. But allows 2 dropped packets.
After investigating the issue a bit more, i found out that it is actually possible to change the tcp keepalive settings after the connection is already established. Meaning we are able to disable tcp keepalives when grpc/http2 keepalives are enabled. I have now prepared a commit that fixes the issue: https://github.com/Lucaber/grpc-go/commit/d35dd281857feffafdf03efc1a1bdf2724eb0991 I could create a PR if desired.
A tcpdump with default settings now looks like this.
Keepalives from the server side are disabled, but
TCP_USER_TIMEOUT is still configured without side effects.
Keepalives from the client side are enabled (as seen in the screenshot), but TCP_USER_TIMEOUT is not being set. The connection only resets after 9 lost packets.
Any updates on this issue? The mentioned commit seams to be the best option to handle this problem without requiring an action by the user
We have several people out of the office for an extended amount of time, so this is unlikely to get any attention for a few weeks, sorry.
Sorry, haven't been able to get back to this. Will definitely pick this up this week.
Go TCP support does the following by default when creating a TCP connection:
- Enables TCP keepalives using the
SO_KEEPALIVEsocket option - Sets both the keepalive idle period (
TCP_KEEPIDLE) and keepalive interval (TCP_KEEPINTVL) to the default value of15s(or the value specified by the user)- This means that the TCP layer will start sending out TCP keepalives after
15sof inactivity, and - Will continue to send TCP keepalives every
15sas long as there is no activity - And with a default value of
TCP_KEEPCNT=9, if no gRPC keepalives are configured, the connection would be terminated after150sof inactivity (the first15sof inactivity is required to send the first keepalive). See: https://github.com/golang/go/blob/598958f0f247fa24b8ed4dfcd454a1958f212666/src/net/tcpsock.go#L238
- This means that the TCP layer will start sending out TCP keepalives after
Now, if gRPC keepalives are configured, the TCP_USER_TIMEOUT is set to the value of Timeout specified in the keepalive configuration, or a default of 20s. And as @Lucaber correctly pointed out, there is a difference between the client and the server sides.
- On the server, a default
keepalive.Timeof2his used even if the user does not specify one, and therefore theTCP_USER_TIMEOUTgets set to20s. - On the client side,
TCP_USER_TIMEOUTgets set to20sor the value ofkeepalive.Timeoutonly ifkeepalive.Timeis specified by the user.
IIUC correctly, if the keepalive.Time value is set to a reasonably high value (let's say 2h), and keepalive.Timeout is not specified, gRPC keepalives will not kick in before 2h of inactivity and TCP keepalives will kick in after 15s of inactivity. And according to the man page:
TCP_USER_TIMEOUT:
...
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will override keepalive to determine when
to close a connection due to keepalive failure.
...
And because gRPC has set TCP_USER_TIMEOUT to 20s, and the Go stdlib has set SO_KEEPALIVE, the connection is being shutdown after 20s of sending the first TCP keepalive probe.
@ejona86 : Do you have any thoughts on how to handle this? @Lucaber has a commit where they are disabling TCP keepalives if gRPC keepalive is configured.
Sets both the keepalive idle period and keepalive interval to the default value of 15s (or the value specified by the user)
TCP keepalive? 15s?! That's really aggressive. That's quite different than I remember when I looked at Go before. I think that is likely a Go regression. Seems it was caused by https://github.com/golang/go/commit/fbf763fd1d6be3c162ea5ff3c8843171ef937c3a ?
Go's configuration for keepalive is partly counter-productive[1] and we were purposefully not changing it. We should have just been setting "use TCP keepalive" but using the OS defaults. If we can't use OS defaults any more, then Go has broken us.
- Go's configuration changes both TCP keepalive time and interval, and (understandably, because Windows, doesn't let you configure retry count). It would have been more useful if it only changed the time. Let's assume the default is "time = 1h, interval = 60s, retries=11." On broken connection, it will take 1h+10*60s = 1h10m to kill the connection. If you configure Go to use 5 minutes (the default server-permitted lower bound allowed for HTTP/2 keepalive), it'll take 11*5m = 55m for TCP to kill the connection. Very little gain for such an aggressive change.
If we can't use OS defaults any more, then Go has broken us.
Looks like a yes: https://github.com/golang/go/issues/48622
We should have just been setting "use TCP keepalive" but using the OS defaults.
Are you saying, we should have just used TCP keepalives instead of implementing gRPC level keepalives? Or are you saying, we should have set TCP keepalive options instead of setting TCP_USER_TIMEOUT?
Also, what are your thoughts about disabling TCP keepalives when gRPC keepalives are enabled? This is what I'm taking about.
I'm saying:
- Implement HTTP/2 level keepalives as defined by the gRFC
- Set TCP_USER_TIMEOUT if it is available, when (1) is enabled
- Enable TCP keepalives, unconditionally
(2) and (3) aren't documented in any gRFC I believe. (2) was done as part of the ALTS work, and it does yield better behavior for all folks. (3) is something that Java's been doing forever, and still do independent of (1). Even with (1), (3) is still useful as some people configure their OS to use more aggressive settings for their specific environment (e.g., AWS). For those environments (3) behaves well without any extra work from users. And if you aren't in such an environment, then (3) is very low or zero cost.
(1) We currently do implement HTTP/2 level keepalives as described in A8 and A9.
(2) We currently also set TCP_USER_TIMEOUT when available and when (1) is enabled. Also this is documented in A18.
(3) The TCP implementation in Go currently enables TCP keepalives by default unless explicitly specified by the user to disable it (either via a Dialer config or via a Listener config).
The slight nuance here is that on the server side, we end up setting TCP_USER_TIMEOUT even if (1) is not enabled. This is because the default value for keepalive.Time on the server is 2h, and therefore unless the user explicitly sets keepalive.Time to infinity on the server, the check to set TCP_USER_TIMEOUT ends up passing, and we end up setting it to the default value of 20s, unless keepalive.Timeout is specified by the user.
- Do you think this behavior is a bug?
So, even after (1) (2) and (3), if a user enables gRPC keepalives on the client and sets keepalive.Time to a reasonable value like 1h or 2h and doesn't set keepalive.Timeout:
- the first TCP keepalive probe will be sent out after
15sof inactivity - and if this keepalive probe is not ACKed, the connection will be terminated after
20s.
Would it make sense to at least set the TCP keepalive time to whatever is configured using keepalive.Time so that TCP keepalives don't kick in after 15s of inactivity?
Looks like in Java we only set TCP_USER_TIMEOUT on client-side. I make no argument whether that is a good idea or not.
the first TCP keepalive probe will be sent out after 15s of inactivity
No. This is the broken Go behavior. Bad things will happen. We don't want that. We file a bug and get them to fix it. And then we can consider workarounds for the current version once they respond.
Proposal https://github.com/golang/go/issues/62254 has been created to enable setting of TCP keepalive time and interval separately. Doesn't affect us directly since we were not setting TCP keepalive socket options explicitly.
We do have the option of disabling the setting of these TCP keepalive values by Go to the default of 15s by passing a negative value for Dialer.KeepAlive on the client and ListenConfig.KeepAlive on the server.
-
On the client, we create the raw connection here: https://github.com/grpc/grpc-go/blob/4c9777ceff549b54cf37acc64504a10f786c58e2/internal/transport/http2_client.go#L179 We currently pass a
net.Dialer{}with no customizations. It should be easy to set theKeepAlivefield of this dialer to a negative value -
On the server though, we accept a
net.Listenerin Serve from the user. The only thing we can do here is to document the current behavior clearly, and ask users who don't want the Go defaults of15sand instead want the OS defaults, to create anet.Listenerusing https://pkg.go.dev/net#ListenConfig.Listen withListenConfig.KeepAliveset to a negative value.- We could even update one of our examples to illustrate this.
What do you feel about this approach? @Lucaber @dfawley @ejona86
We could also reset the tcp keepalive settings to the os defaults. My commit only disabled them, but reading the defaults should work too. (At least on Linux)
I could work on a poc if this is a valid option
We could also reset the tcp keepalive settings to the os defaults. My commit only disabled them, but reading the defaults should work too. (At least on Linux)
I prefer the options of using the Keepalive field in net.Dialer, over directly calling tcpConn.SetKeepAlive(false) so that we don't disable TCP keepalives if it is enabled by default in the system. But the former option does leave the responsibility to the user for the server-side. But given that Go has set the default value of TCP keepalive time and interval to 15s for about 5 years, and if people have been ok with it so far, then they probably aren't too worried about it anyway.
I could work on a poc if this is a valid option
Let's wait to see what @dfawley and @ejona86 feel about this option. Thanks.
We need the following changes:
- On the client side:
- The creation of the underlying TCP connection happens here: https://github.com/grpc/grpc-go/blob/1e0d82e9f00e430f9d6c903b814d1aea0a6566e8/internal/transport/http2_client.go#L152
- On the last line of this function, we call
DialContexton an emptynet.Dialer. - Instead we would have to use a
net.Dialer, with theKeepalivefield set to a negative value. - This would disable the
15soverride of the TCP keepalive time and interval by Go, and retain OS defaults. - So, something like
return (&net.Dialer{Keepalive: time.Duration(-1)}).DialContext(ctx, networkType, address)
- On the last line of this function, we call
- The creation of the underlying TCP connection happens here: https://github.com/grpc/grpc-go/blob/1e0d82e9f00e430f9d6c903b814d1aea0a6566e8/internal/transport/http2_client.go#L152
- We also need to document the custom dialer option, that Go overrides the OS defaults for TCP keepalive time and interval and that it is the user's responsibility to retain the OS defaults, if they care about it.
- On the server side:
- Since, we accept a
net.Listenerfrom the user, we cannot set theKeepaliveduration in the ListenConfig. It is only possible to get anet.Listenerfrom anet.ListenConfigand not the other way around. - We need to document grpc.Server.Serve that Go overrides the OS defaults for TCP keepalive interval and time, and if the user cares about it, they could set
net.ListenConfig.Keepaliveto a negative value and use the Listen method onnet.ListenConfigto create anet.Listenerthat they can then pass togrpc.Server.Serve. - We should also update the helloworld example server to demonstrate how to do this: https://github.com/grpc/grpc-go/blob/1e0d82e9f00e430f9d6c903b814d1aea0a6566e8/examples/helloworld/greeter_server/main.go#L50
- Since, we accept a
@Lucaber Would you be interested in sending a PR for this? Thanks.
Hi @easwars @ginayeh, happy to pick up this issue
Sounds great - let us know if you need anything!
@dfawley @Lucaber may I check how do I replicate the original issue?
grpc version: v1.58.2 go version: go1.20.4 darwin/arm64 OS: macOS Ventura 13.4
at packet 3773, I applied the rule to pftcl:
block drop in on lo0 proto tcp from any to any port = 8080
but my server does not reset the connection after a single lost keepalive packet (I am using the default keepalive settings on both client and server side).
Please kindly advise on how I can replicate the issue. Also do forgive me as I am not super experienced with computer networking but I'll pick up the pieces along the way! Thank you!
@JaydenTeoh -- Since the root cause for the issue was already done, I guess trying to replicate the original issue is not a blocker for the fix.
It would be nice if @Lucaber or @JaydenTeoh are able to reproduce the original issue and then confirm that the fix in #6672 truly fixes the issue. Maybe @Lucaber can confirm?
Also it appears the fix is incomplete. We need to make sure we're setting the keepalive socket option as appropriate (see https://github.com/grpc/grpc-go/pull/6672#discussion_r1386248342).
@JaydenTeoh -- Did you want to patch the fix for https://github.com/grpc/grpc-go/pull/6672#discussion_r1386248342?
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
@arvindbr8 Sorry on the delay. I'll be happy to work on the fix! Do give me slightly more than a week because I am currently caught up in finals if that is okay!