for-mac
for-mac copied to clipboard
Docker Desktop breaks localstack s3 transfer on M1 mac since 4.27.0 (Unknown network issue/limiting)
Description
We noticed that as soon as we upgraded docker desktop to 4.27.0+ that cucumber tests began to hang indefinitely when the aws sdk for ruby is uploading test files to a localstack bucket. I was able to replicate the behaviour (described below) using the awslocal cli. This will happen on a different test each time and on a different iteration each time in the replication steps
Started: when we upgraded docker to 4.27.0 (Upgrading to later or the latest version does not help.)
Workaround: Downgrade docker desktop to 4.26.1
Environment
- OS: MacOS Ventura 13.6.4
- LocalStack: 3.1.0
- Docker Desktop 4.27, 4.28
Affects colleagues with M1 Macbooks with MacOS Ventura . Does NOT affect colleagues with M1 Macbooks with MacOS Sonoma Does NOT affect colleagues with intel macbooks
(We are restricted from upgrading to Sonoma at this time)
Raised ticket with localstack but they are unable to replicate on an M3 Max mac: https://github.com/localstack/localstack/issues/10340
Reproduce
docker run \
--rm -it \
-p 4566:4566 \
-p 4510-4559:4510-4559 \
-e DEBUG=1 localstack/localstack:3.1.0
Different shell session
// Create a 22mb file
dd if=/dev/zero of=samplefile.dat bs=1m count=22
// Create a bucket in localstack
awslocal s3api create-bucket --bucket testbucket
// Attempt to copy the file 50 times
for i in {1..50}; do time awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat --body ./samplefile.dat &> /dev/null; done
Expected behavior
The file copies successfully or fails with an error
Actual: Something in docker is interrupting or killing the connection which makes localstack retry and hang
❯ for i in {1..50}; do
time awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat --body ./samplefile.dat &> /dev/null;
done
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.18s system 63% cpu 0.852 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.17s system 64% cpu 0.837 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.18s system 66% cpu 0.817 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.17s system 64% cpu 0.835 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.18s system 64% cpu 0.844 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.17s system 64% cpu 0.823 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.17s system 63% cpu 0.830 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.35s user 0.17s system 64% cpu 0.818 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.17s system 62% cpu 0.845 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.17s system 64% cpu 0.820 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.18s system 52% cpu 1.038 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.19s system 52% cpu 1.062 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.18s system 63% cpu 0.849 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.18s system 52% cpu 1.049 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.38s user 0.20s system 65% cpu 0.881 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.35s user 0.17s system 62% cpu 0.839 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.17s system 65% cpu 0.817 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.37s user 0.19s system 65% cpu 0.853 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.46s user 0.26s system 0% cpu 3:13.53 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.45s user 0.27s system 0% cpu 3:07.84 total
docker version
❯ docker version
Client:
Cloud integration: v1.0.35+desktop.10
Version: 25.0.1
API version: 1.44
Go version: go1.21.6
Git commit: 29cf629
Built: Tue Jan 23 23:06:12 2024
OS/Arch: darwin/arm64
Context: desktop-linux
Server: Docker Desktop 4.27.0 (135262)
Engine:
Version: 25.0.1
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: 71fa3ab
Built: Tue Jan 23 23:09:35 2024
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: 1.6.27
GitCommit: a1496014c916f9e62104b33d1bb5bd03b0858e59
runc:
Version: 1.1.11
GitCommit: v1.1.11-0-g4bccb38
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
❯ docker info
Client:
Version: 25.0.1
Context: desktop-linux
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1-desktop.4
Path: /Users/mgrundie/.docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.3-desktop.1
Path: /Users/mgrundie/.docker/cli-plugins/docker-compose
debug: Get a shell into any image or container. (Docker Inc.)
Version: 0.0.22
Path: /Users/mgrundie/.docker/cli-plugins/docker-debug
dev: Docker Dev Environments (Docker Inc.)
Version: v0.1.0
Path: /Users/mgrundie/.docker/cli-plugins/docker-dev
extension: Manages Docker extensions (Docker Inc.)
Version: v0.2.21
Path: /Users/mgrundie/.docker/cli-plugins/docker-extension
feedback: Provide feedback, right in your terminal! (Docker Inc.)
Version: v1.0.4
Path: /Users/mgrundie/.docker/cli-plugins/docker-feedback
init: Creates Docker-related starter files for your project (Docker Inc.)
Version: v1.0.0
Path: /Users/mgrundie/.docker/cli-plugins/docker-init
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
Version: 0.6.0
Path: /Users/mgrundie/.docker/cli-plugins/docker-sbom
scout: Docker Scout (Docker Inc.)
Version: v1.3.0
Path: /Users/mgrundie/.docker/cli-plugins/docker-scout
WARNING: Plugin "/Users/mgrundie/.docker/cli-plugins/docker-scan" is not valid: failed to fetch metadata: fork/exec /Users/mgrundie/.docker/cli-plugins/docker-scan: no such file or directory
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 19
Server Version: 25.0.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
runc version: v1.1.11-0-g4bccb38
init version: de40ad0
Security Options:
seccomp
Profile: unconfined
cgroupns
Kernel Version: 6.6.12-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: aarch64
CPUs: 10
Total Memory: 11.67GiB
Name: docker-desktop
ID: 4a465bdf-a712-4327-b3db-2e9e70b6805a
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5555
127.0.0.0/8
Live Restore Enabled: false
WARNING: daemon is not using the default seccomp profile
Diagnostics ID
24EC8920-277D-4196-9055-11FCA2E65C2B/20240229142349
Additional Info
Also raised localstack issue but I'm sure this is a docker issue https://github.com/localstack/localstack/issues/10340
Happy to provide any requested logs that may shed more light.
cc @djs55
Since this also affects the latest docker version (4.28) should version 4.28 label also be added?
Thanks for the report. It's interesting that the bug only manifests on the combination of Ventura + recent Docker Desktop. Could you ask people using Ventura to disable the "Use kernel networking for UDP" option in Settings -> Resources -> Network? There are signs in the diagnostics that the second NIC (used for UDP as an optimisation) is failing to DHCP in a loop, so perhaps this is causing the interruption.
I tried to reproduce the issue on a recent Sonoma beta and did noticing something odd -- I see uploads stalling and if I capture a packet trace with
docker run -it --net=host -v /Users/.../bug:/bug djs55/tcpdump -n -i eth0 -s 0 -w /bug/debug.pcap
and then look at the trace with Wireshark, I see local stack's service on port 4566 close its receive window (i.e. ask not to receive any more data presumably because its socket buffers are full) for a few seconds and then re-open, causing stalls and low throughput:
I'm not sure what to make of this.
Thanks, I tried with Use kernel networking for UDP
disabled but the issue persists.
I obtained a pcap in the following way and also see TCP ZeroWindow, though it does not recover, and the ZeroWindowProbes appear to be duplicated. This ZeroWindowProbe + ZeroWindowProbeAck loop persists even after you kill the awslocal s3api put-object
command
sudo tcpdump -i any port 4566 -n -w 4566.pcap
When I lsof the ports that continue proping I can see they are being used by docker
TCP DUMP
12:02:07.628530 IP6 ::1.51310 > ::1.4566: Flags [.], seq 0:1, ack 1, win 6370, options [nop,nop,TS val 3707105296 ecr 3030307405], length 1
12:02:07.628559 IP6 ::1.51310 > ::1.4566: Flags [.], seq 0:1, ack 1, win 6370, options [nop,nop,TS val 3707105296 ecr 3030307405], length 1
12:02:07.628602 IP6 ::1.4566 > ::1.51310: Flags [.], ack 0, win 0, options [nop,nop,TS val 3030312436 ecr 3707105296], length 0
12:02:07.628606 IP6 ::1.4566 > ::1.51310: Flags [.], ack 0, win 0, options [nop,nop,TS val 3030312436 ecr 3707105296], length 0
lsof
❯ lsof -i tcp:51310
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
com.docke 85642 mgrundie 269u IPv6 0xf9fbd45a05632263 0t0 TCP localhost:kwtc->localhost:51310 (ESTABLISHED)
~ ·································································································································································································································· 12:00:07
❯ lsof -i tcp:51292
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
com.docke 85642 mgrundie 266u IPv6 0xf9fbd45a05630a63 0t0 TCP localhost:kwtc->localhost:51292 (ESTABLISHED)
ps
❯ ps -p 85642 -o command
COMMAND
/Applications/Docker.app/Contents/MacOS/com.docker.backend
Sorry for the delay replying. I've managed to reproduce a similar problem (hopefully the same one) and I've got a prototype fix if you'd like to try it:
With this build I can leave a large transfer running and it doesn't stall for me.
@djs55 I ran the loop for 500 iterations using your build and MacOS Ventura 13.6.6 (22G630) and this is what happen
@ iteration 349
...
//348
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.36s user 0.17s system 66% cpu 0.796 total
//349
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat 0.44s user 0.22s system 0% cpu 3:07.31 total
In a new terminal I ran it manually to see the output. Command took ~3 mins:
❯ awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat --body ./samplefile.dat
Connection was closed before we received a valid response from endpoint URL: "http://localhost:4566/testbucket/somekey/samplefile.dat".
I started a packet capture when the trouble started and continued it after killing the loop and see the same ZeroWindow output in Wireshark as my last post.
I repeated the above steps a second time and trouble started at iteration 123.
I would say your build has definitely improved the situation as it previously took less iterations before failure, also I can run my cucumber test suite now without it hanging due to this problem.
For completeness I did confirm that the 500 iteration loop runs successfully with docker 4.26.1
Tested the Apple Silicon build only as I don not have access to an intel mac anymore.
Thanks for the interesting test results!
@mgrundie-r7 I've got another experimental build, if you'd like to try it. It has a fix for a bug where the "zero window" probe fails to work and the connection stalls. I suspect the previous experimental build encountered this scenario less often, which is why it was slightly improved but not completely fixed:
If you have time to try either of those, let me know how it goes.
Using your latest build I reran the steps we discussed previously.
I ran 500 iterations of the for loop twice (1000 total) without issue. Seems like you've fixed it. :)
Thanks.
(Tested Apple Silicon build only)
Glad to hear it! (for the actual TCP fix most of the credit should go to the fine people over at google/gvisor)
Unfortunately the fix has missed the release deadline for 4.29. I recommend to keep using the dev build until either 4.30 (or perhaps a 4.29.1 update if there is one).
Thanks for your help tracking this down.
Closing this issue because a fix has been released in Docker Desktop 4.30.0
. See the release notes for more details.