grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

TCP Connection reset after a single lost keepalive packet

Open Lucaber opened this issue 2 years ago • 9 comments

What version of gRPC are you using?

1.54.0

What version of Go are you using (go version)?

go1.20.2 linux/amd64

What operating system (Linux, Windows, …) and version?

Linux 6.2

Bug Report

The configuration of TCP_USER_TIMEOUT to 20 seconds in #2307 and #5219 together with go default tcp keepalive interval of just 15 seconds results in a tcp connection being reset after a single lost keepalive packet.

Lost packets of a tcp connection are normally being re-transmitted after a short amount of time, well within the 20 seconds timeout. But tcp keepalive packets are not being re-transmitted (ACK segments that contain no data are not reliably transmitted by TCP). Therefore the timeout is reached after just a single lost packet.

Normally not re-transmitting tcp keepalive packets is fine as the connection is only reseted after TCP_KEEPCNT(default=9) lost keepalive packets.

Test

TCPDump of a test grpc connection (disabled keepalives on the client to reduce packet count, the same issue can be reproduced with default keepalives on both the server and client): image

  • After packet 199 I dropped all traffic TO port 8090
sudo iptables -I INPUT -p tcp --dport 8090 -j DROP
  • Packet 200 gets received by the client and answered in packet 201
  • Packet 201 is only visible in the tcpdump but does not get received by the server
  • Packet 202 resets the connection by the server when the next keepalive would have been send

Increasing the TCP_USER_TIMEOUT to 50 seconds results in the connection only being reset after 3 lost keepalive packets.

		grpc.KeepaliveParams(
			keepalive.ServerParameters{
				Timeout: 50 * time.Second,
			},
		),

Note: Here "keepalive" refers to the grpc/http2 keepalive mechanism and timeout configuration, not the tcp keepalives.

Lucaber avatar May 03 '23 13:05 Lucaber

Our keepalive implementation is based on proposals A8 and A9. Any changes here would need to made across grpc implementations. If you want to propose any changes to this spec, you can do on the grpc-proposal repository.

Lost TCP keepalive packets might not be re-transmitted. But gRPC keepalives use a HTTP/2 PING frame. Why are these getting lost and not retransmitted?

Another thing to note is that gRPC sets the TCP keepalive timeout only when gRPC keepalive Time is not configured. See the code here: https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_client.go#L264. This is the keepalive configuration struct on the client side: https://pkg.go.dev/google.golang.org/[email protected]/keepalive#ClientParameters.

Can you share you gRPC keepalive configuration on the client and server side? Thanks.

easwars avatar May 09 '23 18:05 easwars

Other grpc implementations are most likely not affected by the same issue.

The Design Background of A8 states that tcp keepalives are disabled by default, but when enabled after an implementation-specific duration of inactivity (on the order of 1-2 hours, but sysadmin configurable). This is probably based on RFC1122 This interval MUST be configurable and MUST default to no less than two hours.

With this high interval, tcp keepalives are likely never been send out because other keepalive mechanisms in grpc/http2 send data more regularly. These keealives are normal tcp packets and are been re-transmitted when a single packet drops.

But TCP connections on GO default to a very short keepalive interval of just 15 seconds. Therefore this issue is only present in the go grpc implementation.

I am using the default keepalive settings on both the server and client side. Thats why i think that this issue should at least be documented, or better "fixed" by setting more forgiving default settings.

Lucaber avatar May 10 '23 17:05 Lucaber

Thanks for the explanation @Lucaber

Do you see the problem even in the case where you set both keepalive.ClientParameters.Time and keepalive.ClientParameters.Timeout on the client and keepalive.ServerParameters.Time on the server to non-default values?

And since you are seeing the TCP connection getting closed after a single lost keepalive frame, are you saying that TCP_KEEPCNT=9 is not taking effect?

easwars avatar May 10 '23 18:05 easwars

Correct, TCP_KEEPCNT=9 is not taking effect.

This depends on the values configured. The client and server configuration can mostly be viewed independently as the same issue is happening in both directions.

Setting the Timeout value controls how many keepalive packets can be dropped. Setting this value to 50 seconds, allows 2 lost packets. The connection gets reset after the third dropped packet.

Setting the Time value

  • to time.Duration(math.MaxInt64) disables the configuration of TCP_USER_TIMEOUT and solves the problem. Without a TCP_USER_TIMEOUT, the connecting now times-out only after 9(TCP_KEEPCNT)*15 seconds
  • to a value higher than 15 seconds still enables the TCP keepalive mechanism and the issue persists.
  • to a value lower than 15 seconds effectively disables tcp keepalives, because the connection never goes idle. (At least when a grpc call is running and the connection is being used). But the active connection results in an ENHANCE_YOUR_CALM error after a few grpc-keepalives. Im not sure where i can configure this behavior.

Another way to fix this problem is disabling (or setting to 2 hours) the tcp keepalive interval. This can only be done be the user because grpc-go does not initialize the tcp socket:

var lc net.ListenConfig
lc.KeepAlive = -1
listener, err := lc.Listen(context.Background(), "tcp", "0.0.0.0:8090")
grpc.WithContextDialer(func(ctx context.Context, s string) (net.Conn, error) {
	var d net.Dialer
	d.KeepAlive = -1
	return d.Dial("tcp", s)
}),

Lucaber avatar May 10 '23 20:05 Lucaber

But the active connection results in an ENHANCE_YOUR_CALM error after a few grpc-keepalives. Im not sure where i can configure this behavior.

This can be controlled by setting the MinTime field of server enforcement policy defined here: https://pkg.go.dev/google.golang.org/[email protected]/keepalive#EnforcementPolicy. But you shouldn't have to set the value to such a low one to get around this problem.

If the client sets keepalive.ClientParameters.Time to whatever value (except infinity) and set keepalive.ClientParameters.Timeout to a value greater than 9*15s, that would ensure correct keepalive behavior, wouldn't it?

Apart from documenting this, do you have any other suggestions for handling this issue? Thanks.

easwars avatar May 11 '23 00:05 easwars

Sorry, I just noticed that the default Time of the client side is defaulted to infinity, meaning the TCP_USER_TIMEOUT is actually only set on the server side by default. So the described issue only happens with lost keepalives from the server side.

Otherwise correct, setting the keepalive.ClientParameters.Timeout to a value grater than 9*15s enforces correct keepalive behavior even when keepalive.ClientParameters.Time is set to a different value. Applying this to the server side keepalive.ServerParameters.Timeout fixes the problem.

Im not sure how to handle this problem correctly. The best way to solve this issue according to the spec is setting the correct TCP keepalive interval. But this can only be set by the user explicitly and is not controlled by the library.

Another way would be to increase the Timeout for example to the mentioned 50 seconds. This change would introduce an additional implementation difference from the grpc spec. But allows 2 dropped packets.


After investigating the issue a bit more, i found out that it is actually possible to change the tcp keepalive settings after the connection is already established. Meaning we are able to disable tcp keepalives when grpc/http2 keepalives are enabled. I have now prepared a commit that fixes the issue: https://github.com/Lucaber/grpc-go/commit/d35dd281857feffafdf03efc1a1bdf2724eb0991 I could create a PR if desired.

Lucaber avatar May 11 '23 12:05 Lucaber

image A tcpdump with default settings now looks like this. Keepalives from the server side are disabled, but TCP_USER_TIMEOUT is still configured without side effects. Keepalives from the client side are enabled (as seen in the screenshot), but TCP_USER_TIMEOUT is not being set. The connection only resets after 9 lost packets.

Lucaber avatar May 11 '23 12:05 Lucaber

Any updates on this issue? The mentioned commit seams to be the best option to handle this problem without requiring an action by the user

Lucaber avatar May 24 '23 12:05 Lucaber

We have several people out of the office for an extended amount of time, so this is unlikely to get any attention for a few weeks, sorry.

dfawley avatar May 30 '23 22:05 dfawley

Sorry, haven't been able to get back to this. Will definitely pick this up this week.

easwars avatar Jul 11 '23 17:07 easwars

Go TCP support does the following by default when creating a TCP connection:

  • Enables TCP keepalives using the SO_KEEPALIVE socket option
  • Sets both the keepalive idle period (TCP_KEEPIDLE) and keepalive interval (TCP_KEEPINTVL) to the default value of 15s (or the value specified by the user)
    • This means that the TCP layer will start sending out TCP keepalives after 15s of inactivity, and
    • Will continue to send TCP keepalives every 15s as long as there is no activity
    • And with a default value of TCP_KEEPCNT=9, if no gRPC keepalives are configured, the connection would be terminated after 150s of inactivity (the first 15s of inactivity is required to send the first keepalive). See: https://github.com/golang/go/blob/598958f0f247fa24b8ed4dfcd454a1958f212666/src/net/tcpsock.go#L238

Now, if gRPC keepalives are configured, the TCP_USER_TIMEOUT is set to the value of Timeout specified in the keepalive configuration, or a default of 20s. And as @Lucaber correctly pointed out, there is a difference between the client and the server sides.

  • On the server, a default keepalive.Time of 2h is used even if the user does not specify one, and therefore the TCP_USER_TIMEOUT gets set to 20s.
  • On the client side, TCP_USER_TIMEOUT gets set to 20s or the value of keepalive.Timeout only if keepalive.Time is specified by the user.

IIUC correctly, if the keepalive.Time value is set to a reasonably high value (let's say 2h), and keepalive.Timeout is not specified, gRPC keepalives will not kick in before 2h of inactivity and TCP keepalives will kick in after 15s of inactivity. And according to the man page:

TCP_USER_TIMEOUT:
...

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will override keepalive to determine when
to close a connection due to keepalive failure.

...

And because gRPC has set TCP_USER_TIMEOUT to 20s, and the Go stdlib has set SO_KEEPALIVE, the connection is being shutdown after 20s of sending the first TCP keepalive probe.

@ejona86 : Do you have any thoughts on how to handle this? @Lucaber has a commit where they are disabling TCP keepalives if gRPC keepalive is configured.

easwars avatar Jul 27 '23 01:07 easwars

Sets both the keepalive idle period and keepalive interval to the default value of 15s (or the value specified by the user)

TCP keepalive? 15s?! That's really aggressive. That's quite different than I remember when I looked at Go before. I think that is likely a Go regression. Seems it was caused by https://github.com/golang/go/commit/fbf763fd1d6be3c162ea5ff3c8843171ef937c3a ?

Go's configuration for keepalive is partly counter-productive[1] and we were purposefully not changing it. We should have just been setting "use TCP keepalive" but using the OS defaults. If we can't use OS defaults any more, then Go has broken us.

  1. Go's configuration changes both TCP keepalive time and interval, and (understandably, because Windows, doesn't let you configure retry count). It would have been more useful if it only changed the time. Let's assume the default is "time = 1h, interval = 60s, retries=11." On broken connection, it will take 1h+10*60s = 1h10m to kill the connection. If you configure Go to use 5 minutes (the default server-permitted lower bound allowed for HTTP/2 keepalive), it'll take 11*5m = 55m for TCP to kill the connection. Very little gain for such an aggressive change.

ejona86 avatar Jul 27 '23 17:07 ejona86

If we can't use OS defaults any more, then Go has broken us.

Looks like a yes: https://github.com/golang/go/issues/48622

easwars avatar Jul 27 '23 18:07 easwars

We should have just been setting "use TCP keepalive" but using the OS defaults.

Are you saying, we should have just used TCP keepalives instead of implementing gRPC level keepalives? Or are you saying, we should have set TCP keepalive options instead of setting TCP_USER_TIMEOUT?

Also, what are your thoughts about disabling TCP keepalives when gRPC keepalives are enabled? This is what I'm taking about.

easwars avatar Aug 01 '23 20:08 easwars

I'm saying:

  1. Implement HTTP/2 level keepalives as defined by the gRFC
  2. Set TCP_USER_TIMEOUT if it is available, when (1) is enabled
  3. Enable TCP keepalives, unconditionally

(2) and (3) aren't documented in any gRFC I believe. (2) was done as part of the ALTS work, and it does yield better behavior for all folks. (3) is something that Java's been doing forever, and still do independent of (1). Even with (1), (3) is still useful as some people configure their OS to use more aggressive settings for their specific environment (e.g., AWS). For those environments (3) behaves well without any extra work from users. And if you aren't in such an environment, then (3) is very low or zero cost.

ejona86 avatar Aug 01 '23 20:08 ejona86

(1) We currently do implement HTTP/2 level keepalives as described in A8 and A9.

(2) We currently also set TCP_USER_TIMEOUT when available and when (1) is enabled. Also this is documented in A18.

(3) The TCP implementation in Go currently enables TCP keepalives by default unless explicitly specified by the user to disable it (either via a Dialer config or via a Listener config).

The slight nuance here is that on the server side, we end up setting TCP_USER_TIMEOUT even if (1) is not enabled. This is because the default value for keepalive.Time on the server is 2h, and therefore unless the user explicitly sets keepalive.Time to infinity on the server, the check to set TCP_USER_TIMEOUT ends up passing, and we end up setting it to the default value of 20s, unless keepalive.Timeout is specified by the user.

  • Do you think this behavior is a bug?

So, even after (1) (2) and (3), if a user enables gRPC keepalives on the client and sets keepalive.Time to a reasonable value like 1h or 2h and doesn't set keepalive.Timeout:

  • the first TCP keepalive probe will be sent out after 15s of inactivity
  • and if this keepalive probe is not ACKed, the connection will be terminated after 20s.

Would it make sense to at least set the TCP keepalive time to whatever is configured using keepalive.Time so that TCP keepalives don't kick in after 15s of inactivity?

easwars avatar Aug 01 '23 21:08 easwars

Looks like in Java we only set TCP_USER_TIMEOUT on client-side. I make no argument whether that is a good idea or not.

the first TCP keepalive probe will be sent out after 15s of inactivity

No. This is the broken Go behavior. Bad things will happen. We don't want that. We file a bug and get them to fix it. And then we can consider workarounds for the current version once they respond.

ejona86 avatar Aug 01 '23 22:08 ejona86

Proposal https://github.com/golang/go/issues/62254 has been created to enable setting of TCP keepalive time and interval separately. Doesn't affect us directly since we were not setting TCP keepalive socket options explicitly.

We do have the option of disabling the setting of these TCP keepalive values by Go to the default of 15s by passing a negative value for Dialer.KeepAlive on the client and ListenConfig.KeepAlive on the server.

  • On the client, we create the raw connection here: https://github.com/grpc/grpc-go/blob/4c9777ceff549b54cf37acc64504a10f786c58e2/internal/transport/http2_client.go#L179 We currently pass a net.Dialer{} with no customizations. It should be easy to set the KeepAlive field of this dialer to a negative value

  • On the server though, we accept a net.Listener in Serve from the user. The only thing we can do here is to document the current behavior clearly, and ask users who don't want the Go defaults of 15s and instead want the OS defaults, to create a net.Listener using https://pkg.go.dev/net#ListenConfig.Listen with ListenConfig.KeepAlive set to a negative value.

    • We could even update one of our examples to illustrate this.

    What do you feel about this approach? @Lucaber @dfawley @ejona86

easwars avatar Aug 24 '23 13:08 easwars

We could also reset the tcp keepalive settings to the os defaults. My commit only disabled them, but reading the defaults should work too. (At least on Linux)

I could work on a poc if this is a valid option

Lucaber avatar Aug 24 '23 14:08 Lucaber

We could also reset the tcp keepalive settings to the os defaults. My commit only disabled them, but reading the defaults should work too. (At least on Linux)

I prefer the options of using the Keepalive field in net.Dialer, over directly calling tcpConn.SetKeepAlive(false) so that we don't disable TCP keepalives if it is enabled by default in the system. But the former option does leave the responsibility to the user for the server-side. But given that Go has set the default value of TCP keepalive time and interval to 15s for about 5 years, and if people have been ok with it so far, then they probably aren't too worried about it anyway.

I could work on a poc if this is a valid option

Let's wait to see what @dfawley and @ejona86 feel about this option. Thanks.

easwars avatar Aug 24 '23 18:08 easwars

We need the following changes:

  • On the client side:
    • The creation of the underlying TCP connection happens here: https://github.com/grpc/grpc-go/blob/1e0d82e9f00e430f9d6c903b814d1aea0a6566e8/internal/transport/http2_client.go#L152
      • On the last line of this function, we call DialContext on an empty net.Dialer.
      • Instead we would have to use a net.Dialer, with the Keepalive field set to a negative value.
      • This would disable the 15s override of the TCP keepalive time and interval by Go, and retain OS defaults.
      • So, something like return (&net.Dialer{Keepalive: time.Duration(-1)}).DialContext(ctx, networkType, address)
  • We also need to document the custom dialer option, that Go overrides the OS defaults for TCP keepalive time and interval and that it is the user's responsibility to retain the OS defaults, if they care about it.
  • On the server side:
    • Since, we accept a net.Listener from the user, we cannot set the Keepalive duration in the ListenConfig. It is only possible to get a net.Listener from a net.ListenConfig and not the other way around.
    • We need to document grpc.Server.Serve that Go overrides the OS defaults for TCP keepalive interval and time, and if the user cares about it, they could set net.ListenConfig.Keepalive to a negative value and use the Listen method on net.ListenConfig to create a net.Listener that they can then pass to grpc.Server.Serve.
    • We should also update the helloworld example server to demonstrate how to do this: https://github.com/grpc/grpc-go/blob/1e0d82e9f00e430f9d6c903b814d1aea0a6566e8/examples/helloworld/greeter_server/main.go#L50

@Lucaber Would you be interested in sending a PR for this? Thanks.

easwars avatar Sep 06 '23 15:09 easwars

Hi @easwars @ginayeh, happy to pick up this issue

JaydenTeoh avatar Sep 28 '23 17:09 JaydenTeoh

Sounds great - let us know if you need anything!

dfawley avatar Sep 28 '23 17:09 dfawley

@dfawley @Lucaber may I check how do I replicate the original issue?

grpc version: v1.58.2 go version: go1.20.4 darwin/arm64 OS: macOS Ventura 13.4

image

at packet 3773, I applied the rule to pftcl:

block drop in on lo0 proto tcp from any to any port = 8080

but my server does not reset the connection after a single lost keepalive packet (I am using the default keepalive settings on both client and server side).

Please kindly advise on how I can replicate the issue. Also do forgive me as I am not super experienced with computer networking but I'll pick up the pieces along the way! Thank you!

JaydenTeoh avatar Sep 30 '23 20:09 JaydenTeoh

@JaydenTeoh -- Since the root cause for the issue was already done, I guess trying to replicate the original issue is not a blocker for the fix.

arvindbr8 avatar Oct 16 '23 22:10 arvindbr8

It would be nice if @Lucaber or @JaydenTeoh are able to reproduce the original issue and then confirm that the fix in #6672 truly fixes the issue. Maybe @Lucaber can confirm?

dfawley avatar Nov 07 '23 21:11 dfawley

Also it appears the fix is incomplete. We need to make sure we're setting the keepalive socket option as appropriate (see https://github.com/grpc/grpc-go/pull/6672#discussion_r1386248342).

dfawley avatar Nov 09 '23 22:11 dfawley

@JaydenTeoh -- Did you want to patch the fix for https://github.com/grpc/grpc-go/pull/6672#discussion_r1386248342?

arvindbr8 avatar Nov 15 '23 18:11 arvindbr8

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Nov 21 '23 18:11 github-actions[bot]

@arvindbr8 Sorry on the delay. I'll be happy to work on the fix! Do give me slightly more than a week because I am currently caught up in finals if that is okay!

JaydenTeoh avatar Nov 21 '23 19:11 JaydenTeoh