netstack: possible compat issues or malfunctions of non-standard TCP Timestamp Options
Description
<ACK> dropped silently. A compatibility issue?
What we expects:
sequenceDiagram
participant c as curl
participant s as Server(gVisor)
Note over s: [LISTEN]
c ->> s: SYN (TSopt)
s ->> c: SYN+ACK (TSopt)
Note over s: [SYN-RCVD]
c ->> s: ACK (TSopt)
Note over s: [ESTAB]
What happened:
sequenceDiagram
participant c as curl
participant s as Server(gVisor)
Note over s: [LISTEN]
c ->> s: SYN (TSopt)
s ->> c: SYN+ACK (TSopt)
Note over s: [SYN-RCVD]
c -x s: ACK (no TSopt)
Note over s: retransmission back-off 1s...
s ->> c: SYN+ACK (TSopt, retransmits)
c -x s: ACK (no TSopt, dup)
Note over s: retransmission back-off 2s...
s ->> c: SYN+ACK (TSopt, retransmits)
c -x s: ACK (no TSopt, dup)
Note over s: retransmission back-off 4s...
- Client (e.g. curl) established a TCP connection one-side (
connectsyscall completed). However, Server enabled gVisor is still in SYN-RCVD state, waiting for the<ACK>(But we have confirmed the packet has arrived into the gVisor network namespace). - We found the problem only happens when the client side is using old versions of Linux kernel (e.g. 3.x ~ 4.14.x). New versions (e.g. 5.4.x ~ latest) of Linux kernel establish two-way connections with gVisor Server with no problem. It turns out old kernels does not fully comply to the RFC 7323 "TCP Extensions for High Performance" Section 3.2 - Timestamp Options says Once TSopt has been successfully negotiated, that is both
<SYN>and<SYN,ACK>contain TSopt, the TSopt MUST be sent in every non-<RST>segment for the duration of the connection. - On the other hand, gVisor dropped the segments silently. We think it probably because RFC 7323 also states that If a non-
<RST>segment is received without a TSopt, a TCP SHOULD silently drop the segment. Native Linux TCP/IP doesn't do this, no matter old or new version.
Workarounds:
- Upgrading clients to new OS, new kernel (sounds crazy but we did do this, because all those our "clients" are micro-service containers running over few host server, which are easy to migrate).
- Or, disabling the TCP Timestamp feature on client-side completely
sysctl net.ipv4.tcp_timestamp=0. - Or, using host network mode.
Q1: Is this an intentional design decision?
But only partially. Malfunctions?
- Yet, suppose that is working-as-designed. The last weird thing is, even gVisor silently drops those non-standard ACK-of-SYN segments, the connections seem to be stuck in the backlog queue of
accpetsyscall and forever. It leads to a very strange phenomenon. Let's say the server calledlisten(s, 5). Client will keep failing until, suddenly, the 6th attempt of connect succeeded. It becomes completely normal, once and for all, as for this listener. - We soon realise it is because SYN-cookies kick in. The backlog becomes full, and it switched to SYN-cookies. Starting from now, TCP Options in every handshake will be ignored, and so is the TSopt things.
- Even weirder, after leaving the server alone without any more request a whole night, we thought the half-established connections should have been all timed out and discarded, and new coming connection will be blocked again, but it didn't. The first 5 (certainly lost) connections stuck forever, forcing all the other connections use SYN-cookie and possibly degrade the performance.
Q2: Is there something missing when half-established connection (SYN+ACK retransmission) failed? For example, removing the failed endpoint from the queue?
Environment is the same as #11535
cc: @JiaHuann @milantracy
ACK-of-SYN is sliently dropped here: https://github.com/google/gvisor/blob/4ba931dd226e5a04d0d8b89586239ebba72b9e0e/pkg/tcpip/transport/tcp/connect.go#L476-L484 (The comment is not entirely correct, as the RFC states that it is not “MUST”, but “SHOULD”)
Fly by , yes this was an intentional decision based on the RFC. I probably missed checking the Linux implementation on whether it implements the specific part of the RFC.
@kevinGC FYI.
Hey Bhasker, @nybidari leads netstack now.
Doing what Linux does -- which is apparently just not caring whether a timestamp is returned -- seems like a reasonable way to support broken clients like this. WDYT?
Agreed. Congrats @nybidari ! Yes matching linux behaviour seems correct.
Thanks Kevin and Bhasker.
For Q1: Yes, we silently drop the ACKs if the timestamp option has been negotiated and the ACKs do not have timestamp option.
For Q2: The pendingEndpoints are deleted:
- when there is an error
- when the endpoints are shutdown/closed. In this case, we silently drop the ACKs and do not return error, which is why the pendingEndpoints are not cleaned up.
I can work on this in the next one or two weeks. We also welcome any PRs to fix this issue :)
Thanks Kevin and Bhasker.
For Q1: Yes, we silently drop the ACKs if the timestamp option has been negotiated and the ACKs do not have timestamp option.
For Q2: The pendingEndpoints are deleted:
- when there is an error
- when the endpoints are shutdown/closed. In this case, we silently drop the ACKs and do not return error, which is why the pendingEndpoints are not cleaned up.
I can work on this in the next one or two weeks. We also welcome any PRs to fix this issue :)
Yes, I am working on this. But I haven't find any tutorial for building from source with debug info. Can you give some advice for that? I am not good at bazel.
I followed this page. https://gvisor.dev/docs/user_guide/debugging/
make dev BAZEL_OPTIONS="-c dbg --define gotags=debug"
And the runsc report unknown platform for systrap.
Thanks. ;)
Forget about it, I just built gVisor with debug info successfully. Now testing the fix to the issue.
Thanks, @nybidari, for your clarification!
For Q1: Yes, we silently drop the ACKs if the timestamp option has been negotiated and the ACKs do not have timestamp option.
That makes sense. However, do we have any plan to match or shift to Linux's behaviour in this case, or would you rather stick to the current one?
Okay, I saw @JiaHuann submit a pendingEndpoints fix in #11557; thank you, JiaHuann 8)
@nybidari, any thoughts on the PR #11557?