TCP connection stall, avoided by disabling AutoTuneSndBuf
Description
[First of all, thanks for writing this awesome piece of software!]
Using the gvisor tcpip library on macOS (across a range of versions, including the latest I could find i.e. gvisor.dev/gvisor v0.0.0-20240331093445-9d995324d058) I see a stall where:
- in a macOS terminal:
iperf3 -c localhost -t 3600(the 3600 is aspirational) - connects to
localhost:5201which is accepted by a proxy which calls gonet.DialContextTCP - which uses gvisor to send ethernet frames over virtio to a Linux 6.6 kernel running in a VM
- where
iperf3 -sis running (in a docker container)
Typically after a few minutes the stream stalls (usually < 3 minutes). If I apply this patch it doesn't seem to stall:
diff --git a/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go b/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go
index 7b0af1759d..a21f477a14 100644
--- a/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go
+++ b/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go
@@ -865,6 +865,7 @@ func newEndpoint(s *stack.Stack, protocol *protocol, netProto tcpip.NetworkProto
windowClamp: DefaultReceiveBufferSize,
maxSynRetries: DefaultSynRetries,
}
+ e.sndQueueInfo.TCPSndBufState.AutoTuneSndBufDisabled.Store(1)
e.ops.InitHandler(e, e.stack, GetTCPSendBufferLimits, GetTCPReceiveBufferLimits)
e.ops.SetMulticastLoop(true)
e.ops.SetQuickAck(true)
I can also achieve the same effect by creating my endpoint manually (not using gonet) and using:
// This sets AutoTuneSndBufDisabled. Without this iperf3 stalls writing.
ep.SocketOptions().SetSendBufferSize(ep.SocketOptions().GetSendBufferSize(), true)
Attached is a .pcap of a TCP session which was transferring data quickly until it stalled: gvisor-report-s100.zip
Let me know what you think! Thanks again.
Related to https://github.com/docker/for-mac/issues/7207
Steps to reproduce
The bug reproduces with Docker Desktop 4.28 on Mac (which uses gvisor internally).
In one terminal:
docker run -it --rm -p 5201:5201/tcp -p 5201:5201/udp dockerpinata/iperf3:2.1 iperf3 -s
In another terminal:
iperf3 -t 3600 -c localhost
runsc version
No response
docker version (if using docker)
No response
uname
Darwin foo.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:41 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8103 arm64
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
Thanks for the report. Let me see whether we can repro this.
Also: I don't think I knew that gVisor's netstack was used by Docker on Mac. That's awesome.
I can repro this on Mac. On Linux it runs for the full hour without issue. I'll look at whether there's anything obviously platform-dependent, but it would be really helpful if I could build docker for mac to enable debug logs and add bits that'll help me track down the issue. Is there any way for me to do so?
A couple first observations about the pcap:
- There are a bunch of
[TCP ACKed unseen segment]that I can't explain given that they occur when the connection looks otherwise healthy. Maybe this just has to do with the packet capture method? - The packet just before throughput drops to zero is the server (running netstack) saying its receive window is zero. It's weird that the fix -- which affects the send buffer -- resolves an issue with the receive buffer.
I misunderstood the setup. gVisor is sending packets that are originally from the client, and the server is running in a Linux VM. So it's not weird that messing with the send buffer affects behavior.
That means that the server is advertising a zero window, which is strange.
And another question: is the pcap between the client and proxy, or the proxy and iperf server?
Hi Kevin, thanks for taking a look!
It would be really helpful if I could build docker for mac to enable debug logs and add bits that'll help me track down the issue. Is there any way for me to do so?
The only way I can suggest is if there are particular things you'd like to add or enable, if you point me at them I can publish a developer build via our internal CI. (Sorry I realise this is very inconvenient!)
gVisor is sending packets that are originally from the client, and the server is running in a Linux VM. So it's not weird that messing with the send buffer affects behavior.
Yes that's it -- the iperf3 -c is on the Mac and is sending as much data as it can via gvisor, as quickly as it can. I saw the (Linux kernel) server advertising a zero window too, which is definitely confusing.
I generated the pcap on the Linux side with
docker run -it -v ~/bugs/receive-window-closed/:/foo --net=host djs55/tcpdump -n -i eth0 -w /foo/iperf.pcap
On Linux by default Docker Desktop is using qemu's builtin networking. I'd like to switch this to gvisor by default but there's a different bug in play (which I suspect is a lower-level qemu virtio-net problem) If you'd like to play with it, you can edit .docke/desktop/settings.json and set networkType to gvisor.
Back on Mac, the default ethernet layer is a SOCK_DGRAM socketpair shared with the Apple virtualization.framework which implements virtio-net. Perhaps there's some issue there which is triggering bad behaviour on the higher layer. Under stress it does drop datagrams. I don't know if it re-orders things too. I notice that if I switch the hypervisor to qemu (from the whale menu -> Settings -> uncheck "Use Virtualisation framework") which on Mac does use gvisor (sorry, too many configuration combinations!) then the iperf3 seems slower (10x) but healthier.
Thanks again for all your help so far
Another data point: I shared a build with the AutoTuneSndBuf tweak on https://github.com/docker/for-mac/issues/7207#issuecomment-2034153742 and the user reports it seems to help, but they still see the zero window stall eventually. So it sounds like a lower level issue, one which the AutoTuneSndBuf is good at triggering, but doesn't cause directly.
A nice thing about this kind of bug is that it gives me an opportunity to increase debuggability a bit. I've created a dev build of Docker Desktop Mac arm64 Mac amd64 which, if this file exists:
touch /tmp/docker-desktop-gvisor.pcap
and then the app is started, it will record the last 1024 packets sent to the VM in a ring. When the app quits it writes the ring to disk. So this gives the trace more from the gvisor point of view (rather than using docker run tcpdump to see it from the VM's point of view) Here's an example run with a stall:
docker-desktop-gvisor.zip
It looks similar to the previous trace
Edit: the pcap is has a bug where the frame length == capture length == 100
[ Sorry for the spam, here's another trace: docker-desktop-gvisor.zip taken on the host with the frame length field fixed, so Wireshark can see the payload length. I'll refresh the DD builds in case that's useful. ]
Can you try a build with this patch? My current theory is that the proxy is blocked waiting to write out some data, but the send buffer shrinks and we forget to notify the proxy that it can write again. That would show up in the pcap as a a receive buffer zero window because packets sent by iperf would queue up at the proxy/netstack waiting for the send.
I have a theory+test for how AutoTuneSndBuf could cause the issue. I'm less sure about how the user is getting the stall without auto tuning.
I refreshed to gvisor.dev/gvisor v0.0.0-20240403010941-979aae3d2c21 and applied the patch (minus the test parts which weren't vendored):
% git show
commit 11931d7816e1c29428300ad9206ada8a26768e33 (HEAD -> gvisor-debug)
Author: David Scott <[email protected]>
Date: Wed Apr 3 19:14:04 2024 +0100
https://github.com/google/gvisor/pull/10245/commits/ca92de4400d4351ba3b505b3ced2e42d9fd6323f
diff --git a/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go b/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go
index cf0b0093b9..6f3cc457a8 100644
--- a/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go
+++ b/vendor/gvisor.dev/gvisor/pkg/tcpip/transport/tcp/endpoint.go
@@ -2974,11 +2974,11 @@ func (e *Endpoint) updateSndBufferUsage(v int) {
notify = notify && e.sndQueueInfo.SndBufUsed < int(newSndBufSz)>>1
e.sndQueueInfo.sndQueueMu.Unlock()
- if notify {
- // Set the new send buffer size calculated from auto tuning.
- e.ops.SetSendBufferSize(newSndBufSz, false /* notify */)
- e.waiterQueue.Notify(waiter.WritableEvents)
- }
+ // if notify {
+ // Set the new send buffer size calculated from auto tuning.
+ e.ops.SetSendBufferSize(newSndBufSz, false /* notify */)
+ e.waiterQueue.Notify(waiter.WritableEvents)
+ // }
}
// readyToRead is called by the protocol goroutine when a new segment is ready
It still stalled with a zero window :( Here's a pcap of the last ~1000 packets from the host's point of view: docker-desktop-gvisor.zip
The part I don't understand from the pcap, (looking at the latest one), Linux (receiving the iperf stream) sends an ACK
924 0.017288 192.168.65.3 192.168.65.1 TCP 66 5201 → 16816 [ACK] Seq=1980864929 Ack=814060412 Win=10142 Len=0 TSval=291954484 TSecr=312347615
assuming the window scaling is 7, it should be fine to send up to 814060412 + 10142 * 128 == 815338364. gvisor happily sends
961 0.017890 192.168.65.1 192.168.65.3 TCP 32834 16816 → 5201 [PSH, ACK] Seq=815305596 Ack=1980864929 Win=4096 Len=32768 TSval=312347616 TSecr=291954484[Packet size limited during capture]
which ends at 815305596 + 32768 == 815338364 which means the window has genuinely hit zero. Then Linux doesn't open the window
962 0.019635 192.168.65.3 192.168.65.1 TCP 66 [TCP ZeroWindow] 5201 → 16816 [ACK] Seq=1980864929 Ack=815272828 Win=0 Len=0 TSval=291954486 TSecr=312347615
so gvisor sends a series of ACKs (window probes?)
965 0.220037 192.168.65.1 192.168.65.3 TCP 66 16816 → 5201 [ACK] Seq=815272827 Ack=1980864929 Win=4096 Len=0 TSval=312347818 TSecr=291954486
but Linux never opens the window. I don't see any more traffic about this connection from Linux in fact. Maybe it's really some kind of connection tracking / firewall issue on my end preventing Linux from opening the window? I'll try to rule that out just in case...
Agreed that those are window probes sent with exponential backoff. And thanks for looking into the conntrack/firewall stuff.
- TCP checksums in the gVisor -> Linux direction are all zero. Maybe virtio expects real checksums and is dropping packets before they hit tcpdump? Try turning off
RXChecksumOffload. Doesn't make a lot of sense given that the stream works for a while first. - We sometimes run into issues with
os.File(e.g. where, if we don'tdup()the file descriptor ourselves, Go will close the file descriptor when theos.Fileis garbage collected. Are you usingos.Fileanywhere to connect things? Again it doesn't seem that likely -- if this were the issue I'd expect we wouldn't seen window probes.
@kevinGC why are TCP checksums zero when sending from gvisor? If we don't have GSO offload then they should be set by gvisor? Is it possible that somehow window probes are being sent with a zero checksum incorrectly? If so Linux would drop them and never see the zero window probe resulting in this behavior.
Given that the proxy is a go program not running gvisor maybe the stack is not being setup the same way as gvisor so could behave differently.
https://github.com/google/gvisor/blob/5b30c4a27543f14351b154d775f5ce315010abc7/pkg/tcpip/transport/tcp/snd.go#L908 is the code that sends ZWP.
https://github.com/google/gvisor/blob/5b30c4a27543f14351b154d775f5ce315010abc7/pkg/tcpip/transport/tcp/connect.go#L839 is where we decide if we need to set a checksum.
@kevinGC thanks for your thoughts. I'll dig in a bit more on my side (once I'm back from PTO next week). Thanks again for all your help so far.
FYI the checksums are off because I've set stack.CapabilityTXChecksumOffload in gvisor (which is sending to Linux) and modified Linux to not expect the checksums at the virtio-net layer (a bit of a hack because there's no way to negotiate this capability currently). I can certainly try turning the checksums back on to see what happens.
If Linux is configured to ignore checksums it should not matter. But worth giving it a shot anyway. That said somethings about how send/receive tuning works.
Send buffer tuning relies on how much data is acked by receiver to grow send buffers. Receive buffer tuning depends on how quickly the data is read by the application in this case your proxy code. If your receive buffers are not growing it's probably because the proxy is not reading fast enough.
If send buffers are not growing it's probably because you are not getting timely acks which move the sliding window forward fast enough.
If Linux is not opening it's receive buffer it would mean that iperf is not reading the data which seems highly unlikely. The other reason it could go to zero is if there are too many out of order packets being received and gvisor isnt sending the packet that closes the gap. In such a case then out of order packets will fill up the receive buffer on the Linux side resulting in a stall till the packet that moves the window forward is sent by gVisor.
Something I missed. I noticed you are using a SOCK_DGRAM unix socket, I would change it to SOCK_SEQPACKET which guarantees in order delivery.
Just saw the pcap the whole ACKing a packet not seen seems to indicate Out of Order delivery of packets. Your issue might be the use of SOCK_DGRAM vs SOCK_SEQPACKET.
@hbhasker thanks, I’ll investigate and report back :)
Also since there’s no evidence of gvisor malfunctioning (the link to the auto tune buffer option seems more like a coincidence now) feel free to close this ticket in the meantime!
Thanks for all the help Bhasker. A few thoughts:
ACKing a packet not seen seems to indicate Out of Order delivery of packets
My understanding is that ACKing unseen packets usually comes from capture not keeping up with transmission. I'm not sure I follow how unseen packet ACKs indicates out of order delivery.
gvisor isnt sending the packet that closes the gap
gVisor is sending window probes and getting nothing back; if if there's a gap we should be getting back an ACK for a sent-and-dropped packet. Even with a zero window, we should be getting some kind of response.
And I'll just leave this open until I hear whether SOCK_SEQPACKET does the trick.
Oh, here's a thought: our zero window probes are broken? IIUC they should be sending the next byte in the probe. From RFC 9293 3.8.6.1:
The sending TCP peer must regularly transmit at least one octet of new data (if available), or retransmit to the receiving TCP peer even if the send window is zero, in order to "probe" the window. This retransmission is essential to guarantee that when either TCP peer has a zero window the reopening of the window will be reliably reported to the other.
Per the pcap, we're sending zero-length probes, which IIUC Linux doesn't have to ACK (why would it, it's just a duplicate and up-to-date ACK):
This seems to be supported by Wireshark calling the probes TCP Dup ACK instead of TCP ZeroWindowProbe, which it seems to do elsewhere (picture from unrelated wireshark Q&A post):
So the theory is we're filling up the Linux receive buffer, getting an ACK+zerowindow packet, and never realizing that the window has reopened because we don't properly probe and thus never know that iperf has consumed its bytes.
That's a good point. I wonder how we missed that. Maybe worth making a quick fix and seeing?
On Fri, Apr 5, 2024, 3:18 PM Kevin Krakauer @.***> wrote:
Oh, here's a thought: our zero window probes are broken? IIUC they should be sending the next byte in the probe. From RFC 9293 3.8.6.1 https://datatracker.ietf.org/doc/html/rfc9293#name-zero-window-probing:
The sending TCP peer must regularly transmit at least one octet of new data (if available), or retransmit to the receiving TCP peer even if the send window is zero, in order to "probe" the window. This retransmission is essential to guarantee that when either TCP peer has a zero window the reopening of the window will be reliably reported to the other.
Per the pcap, we're sending zero-length probes, which IIUC Linux doesn't have to ACK (why would it, it's just a duplicate and up-to-date ACK):
Screenshot.2024-04-05.at.3.07.12.PM.png (view on web) https://github.com/google/gvisor/assets/1718925/be6127ce-75f1-4778-84d1-906790dc330c
This seems to be supported by Wireshark calling the probes TCP Dup ACK instead of TCP ZeroWindowProbe, which it seems to do elsewhere (picture from unrelated wireshark Q&A https://osqa-ask.wireshark.org/questions/21230/tcp-acked-unseen-segment-wireshark-bug/ post):
[image: TCP ZeroWindowProbe] https://camo.githubusercontent.com/a76d255879fa5de4eb8565af696a12ab503df7953e0fb039d2603fc8f57076aa/68747470733a2f2f6f7371612d61736b2e77697265736861726b2e6f72672f757066696c65732f436170747572655f31372e504e47
— Reply to this email directly, view it on GitHub https://github.com/google/gvisor/issues/10243#issuecomment-2040700340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSYNOBMDMJMLQ6KF7YF43Y34PJNAVCNFSM6AAAAABFTOAPCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQG4YDAMZUGA . You are receiving this because you were mentioned.Message ID: @.***>
#10252 has the change, although I'm still wrestling with tests.
Wait but that's sending snduna-1. So sending the last acked byte should be fine too since it will elicit a dup ack.
On Fri, Apr 5, 2024, 4:49 PM Kevin Krakauer @.***> wrote:
#10252 https://github.com/google/gvisor/pull/10252 has the change, although I'm still wrestling with tests.
— Reply to this email directly, view it on GitHub https://github.com/google/gvisor/issues/10243#issuecomment-2040779559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSYNK7RUXLHL6LKIECBHTY342B5AVCNFSM6AAAAABFTOAPCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQG43TSNJVHE . You are receiving this because you commented.Message ID: @.***>
@djs55 could you try #10252?
@hbhasker looks like Linux doesn't follow the RFC either: it basically sends a keepalive, which has the same effect. I.e. it sends a byte with an already-ACKed sequence number, not the next byte. That seems easier, so I'll do that when I get the chance. But I think #10252 should also fix the problem for now.
Good news -- with #10252 I can't reproduce the stall any more! iperf3 has run for 3 hours so far today with no problem. To sanity-check I removed the patch and it stalled in under 60s. For fun I plotted a graph from the first 2 hours of the iperf output:
Regarding SEQPACKET: I think this will improve the situation on Linux (thanks for the tip!) Unfortunately I think it's not supported by macOS yet. macOS has the headers but rejects my attempt to use the socket type:
% cat main.c
#include <sys/socket.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
void check(char *description, int family, int type){
int s = socket(family, type, 0);
if (s == -1) {
printf("%s errno = %d (%s)\n", description, errno, strerror(errno));
} else {
close(s);
printf("%s OK\n", description);
}
}
int main(){
check("AF_INET SOCK_SEQPACKET", AF_INET, SOCK_SEQPACKET);
check("AF_LOCAL SOCK_SEQPACKET", AF_LOCAL, SOCK_SEQPACKET);
check("AF_INET SOCK_DGRAM ", AF_INET, SOCK_DGRAM);
check("AF_LOCAL SOCK_DGRAM ", AF_LOCAL, SOCK_DGRAM);
}
% cc -o seqpacket main.c
% ./seqpacket
AF_INET SOCK_SEQPACKET errno = 43 (Protocol not supported)
AF_LOCAL SOCK_SEQPACKET errno = 43 (Protocol not supported)
AF_INET SOCK_DGRAM OK
AF_LOCAL SOCK_DGRAM OK
(on Linux it looks ok) I also found a comment in the chromium source
#if BUILDFLAG(IS_APPLE)
// macOS does not support SEQPACKET.
which is a shame.
However the bug looks fixed to me which is great news. Thanks, both, for all your help!
Glad that worked! We're going to go with #10258 (don't need to segment split, more Linux-like, fewer test changes). Feel free to try it or wait for it to get merged. It should also fix the issue.
A few things I think we should verify.
a) I read the linux code and it doesn't seem to send any data at all. It's behaviour is only different when the urgent pointer is set where it sends two probes instead of 1. gVisor doesn't implement urgent pointer so we can just ignore that case.
b) As per RFC https://datatracker.ietf.org/doc/html/rfc9293#section-3.10.7.4 sending a packet with an ACK bit set and an out of window sequence number will generate an ACK even if there is no payload. I would write a small linux packet impact test to verify this. Also see (https://sourcegraph.com/github.com/torvalds/linux@fec50db7033ea478773b159e0e2efb135270e3b7/-/blob/net/ipv4/tcp_input.c?L5893) which will always send an out of window dup ack as long as its not rate limited.
c) The implementation in https://github.com/google/gvisor/commit/0b36b4962731d37b1c2b1b0c3a1f1b811bf7261f is not what linux does and introduces a payload in the stream that is not even part of the original data. I would not do this as this would be very strange in TCP packet captures where suddenly ZWP carry a different byte for the same sequence number in probes vs a regular retransmit of data.
I think we are missing something else here. A packetimpact test should help figure it out better.