gvisor icon indicating copy to clipboard operation
gvisor copied to clipboard

netstack: possible compat issues or malfunctions of non-standard TCP Timestamp Options

Open LionNatsu opened this issue 9 months ago • 11 comments

Description

<ACK> dropped silently. A compatibility issue?

What we expects:

sequenceDiagram
	participant c as curl
	participant s as Server(gVisor)
	Note over s: [LISTEN]
	c ->> s: SYN (TSopt)
	s ->> c: SYN+ACK (TSopt)
	Note over s: [SYN-RCVD]
	c ->> s: ACK (TSopt)
	Note over s: [ESTAB]

What happened:

sequenceDiagram
	participant c as curl
	participant s as Server(gVisor)
	Note over s: [LISTEN]
	c ->> s: SYN (TSopt)
	s ->> c: SYN+ACK (TSopt)
	Note over s: [SYN-RCVD]
	c -x s: ACK (no TSopt)
	Note over s: retransmission back-off 1s...
	s ->> c: SYN+ACK (TSopt, retransmits)
	c -x s: ACK (no TSopt, dup)
	Note over s: retransmission back-off 2s...
	s ->> c: SYN+ACK (TSopt, retransmits)
	c -x s: ACK (no TSopt, dup)
	Note over s: retransmission back-off 4s...
  1. Client (e.g. curl) established a TCP connection one-side (connect syscall completed). However, Server enabled gVisor is still in SYN-RCVD state, waiting for the <ACK> (But we have confirmed the packet has arrived into the gVisor network namespace).
  2. We found the problem only happens when the client side is using old versions of Linux kernel (e.g. 3.x ~ 4.14.x). New versions (e.g. 5.4.x ~ latest) of Linux kernel establish two-way connections with gVisor Server with no problem. It turns out old kernels does not fully comply to the RFC 7323 "TCP Extensions for High Performance" Section 3.2 - Timestamp Options says Once TSopt has been successfully negotiated, that is both <SYN> and <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> segment for the duration of the connection.
  3. On the other hand, gVisor dropped the segments silently. We think it probably because RFC 7323 also states that If a non-<RST> segment is received without a TSopt, a TCP SHOULD silently drop the segment. Native Linux TCP/IP doesn't do this, no matter old or new version.

Workarounds:

  • Upgrading clients to new OS, new kernel (sounds crazy but we did do this, because all those our "clients" are micro-service containers running over few host server, which are easy to migrate).
  • Or, disabling the TCP Timestamp feature on client-side completely sysctl net.ipv4.tcp_timestamp=0.
  • Or, using host network mode.

Q1: Is this an intentional design decision?

But only partially. Malfunctions?

  1. Yet, suppose that is working-as-designed. The last weird thing is, even gVisor silently drops those non-standard ACK-of-SYN segments, the connections seem to be stuck in the backlog queue of accpet syscall and forever. It leads to a very strange phenomenon. Let's say the server called listen(s, 5). Client will keep failing until, suddenly, the 6th attempt of connect succeeded. It becomes completely normal, once and for all, as for this listener.
  2. We soon realise it is because SYN-cookies kick in. The backlog becomes full, and it switched to SYN-cookies. Starting from now, TCP Options in every handshake will be ignored, and so is the TSopt things.
  3. Even weirder, after leaving the server alone without any more request a whole night, we thought the half-established connections should have been all timed out and discarded, and new coming connection will be blocked again, but it didn't. The first 5 (certainly lost) connections stuck forever, forcing all the other connections use SYN-cookie and possibly degrade the performance.

Q2: Is there something missing when half-established connection (SYN+ACK retransmission) failed? For example, removing the failed endpoint from the queue?

Environment is the same as #11535

LionNatsu avatar Mar 10 '25 14:03 LionNatsu

cc: @JiaHuann @milantracy

LionNatsu avatar Mar 10 '25 14:03 LionNatsu

ACK-of-SYN is sliently dropped here: https://github.com/google/gvisor/blob/4ba931dd226e5a04d0d8b89586239ebba72b9e0e/pkg/tcpip/transport/tcp/connect.go#L476-L484 (The comment is not entirely correct, as the RFC states that it is not “MUST”, but “SHOULD”)

LionNatsu avatar Mar 11 '25 02:03 LionNatsu

Fly by , yes this was an intentional decision based on the RFC. I probably missed checking the Linux implementation on whether it implements the specific part of the RFC.

@kevinGC FYI.

hbhasker avatar Mar 11 '25 12:03 hbhasker

Hey Bhasker, @nybidari leads netstack now.

Doing what Linux does -- which is apparently just not caring whether a timestamp is returned -- seems like a reasonable way to support broken clients like this. WDYT?

kevinGC avatar Mar 11 '25 17:03 kevinGC

Agreed. Congrats @nybidari ! Yes matching linux behaviour seems correct.

hbhasker avatar Mar 11 '25 18:03 hbhasker

Thanks Kevin and Bhasker.

For Q1: Yes, we silently drop the ACKs if the timestamp option has been negotiated and the ACKs do not have timestamp option.

For Q2: The pendingEndpoints are deleted:

  • when there is an error
  • when the endpoints are shutdown/closed. In this case, we silently drop the ACKs and do not return error, which is why the pendingEndpoints are not cleaned up.

I can work on this in the next one or two weeks. We also welcome any PRs to fix this issue :)

nybidari avatar Mar 11 '25 20:03 nybidari

Thanks Kevin and Bhasker.

For Q1: Yes, we silently drop the ACKs if the timestamp option has been negotiated and the ACKs do not have timestamp option.

For Q2: The pendingEndpoints are deleted:

  • when there is an error
  • when the endpoints are shutdown/closed. In this case, we silently drop the ACKs and do not return error, which is why the pendingEndpoints are not cleaned up.

I can work on this in the next one or two weeks. We also welcome any PRs to fix this issue :)

Yes, I am working on this. But I haven't find any tutorial for building from source with debug info. Can you give some advice for that? I am not good at bazel.

I followed this page. https://gvisor.dev/docs/user_guide/debugging/

make dev BAZEL_OPTIONS="-c dbg --define gotags=debug"

And the runsc report unknown platform for systrap.

Thanks. ;)

JiaHuann avatar Mar 12 '25 04:03 JiaHuann

Forget about it, I just built gVisor with debug info successfully. Now testing the fix to the issue.

JiaHuann avatar Mar 12 '25 10:03 JiaHuann

Thanks, @nybidari, for your clarification!

For Q1: Yes, we silently drop the ACKs if the timestamp option has been negotiated and the ACKs do not have timestamp option.

That makes sense. However, do we have any plan to match or shift to Linux's behaviour in this case, or would you rather stick to the current one?

LionNatsu avatar Mar 18 '25 02:03 LionNatsu

Okay, I saw @JiaHuann submit a pendingEndpoints fix in #11557; thank you, JiaHuann 8)

LionNatsu avatar Mar 18 '25 02:03 LionNatsu

@nybidari, any thoughts on the PR #11557?

LionNatsu avatar Jun 11 '25 12:06 LionNatsu