Duplicate packets in bridge mode
Description
When I run docker-compose setup, with network driver as bridge (by default), the following happens. Replay some pcap file with a single packet inside, e.g. with tcpreplay:
tcpreplay -i eth0 test-dump.pcap
Statistics for network device: eth0
Successful packets: 1
Failed packets: 0
Truncated packets: 0
Retried packets (ENOBUFS): 0
Retried packets (EAGAIN): 0
when running tcpdump, it shows duplicate packets:
root@test:/pcap# tcpdump -i eth0 sctp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:07:24.071132 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
07:07:24.071340 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
The packets shown are absolutely identical, except that timestamps are a few microseconds different.
The problem is not only with tcpdump, but with any program written with libpcap library.
When running stats on network interface, it shows that only 1 packet was transmitted, and 1 received:
before tcpreplay:
root@test:/pcap# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65535
inet 172.80.0.3 netmask 255.255.255.0 broadcast 172.80.0.255
ether 02:42:ac:50:00:03 txqueuelen 0 (Ethernet)
**RX packets 7** bytes 746 (746.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
**TX packets 0** bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
after tcpreplay:
root@test:/pcap# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65535
inet 172.80.0.3 netmask 255.255.255.0 broadcast 172.80.0.255
ether 02:42:ac:50:00:03 txqueuelen 0 (Ethernet)
**RX packets 8** bytes 880 (880.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
**TX packets 1** bytes 134 (134.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
The problem happens under the following conditions:
- only on mac with M1, M2 processors (does not happen on Linux or Windows).
- only when default network with bridge driver is used. Not happening when host or macvlan driver is used.
- happens with both native arm linux image, or x86_64 image running under Rosetta
Also ran strace on tcpdump, it shows:
syscall_0xeffff7dfb910(0xeffff7dfb918, 0x10204, 0, 0, 0xeffff7db4000, 0xde) = 0xf00008a81ec0
syscall_0xeffff7dfc460(0xeffff7dfc462, 0x10204, 0, 0, 0xeffff7db3000, 0xde) = 0xf00008a81ec0
strace: [ Process PID=16 runs in 64 bit mode. ]
rt_sigreturn({mask=[TRAP BUS FPE SEGV USR2 PIPE CHLD STOP TSTP TTIN URG XCPU XFSZ VTALRM PROF WINCH IO PWR SYS RTMIN RT_1 RT_2 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15]}) = 4294967281
syscall_0x2cb29feb1965965(0x94ff4b9653f96b4b, 0x6e594ff53cb29fe9, 0x61cb29feb5b9653f, 0, 0xeffff7db2000, 0xde) = 0xf00008a81ec0
strace: [ Process PID=16 runs in x32 mode. ]
syscall_0x7ffffeeb3760(0x7ffffeeb42a0, 0x8b, 0x7ffffeeb8760, 0, 0x555555606e70, 0x4008:06:03.656453 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
) = 0x1
syscall_0x7ffffeeb3760(0x7ffffeeb42a0, 0x8b, 0x7ffffeeb8760, 0, 0x555555606e70, 0x4008:06:03.656609 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
) = 0x1
Reproduce
- docker run --privileged -it my-image-with-tcpdump /bin/bash
- docker exec -it $CONTAINER_ID sudo tcpreplay -i eth0 /pcap/ranap-single.pcap
Expected behavior
a single packet is played on the interface
docker version
Client:
Version: 27.0.3
API version: 1.46
Go version: go1.21.11
Git commit: 7d4bcd8
Built: Fri Jun 28 23:59:41 2024
OS/Arch: darwin/arm64
Context: desktop-linux
Server: Docker Desktop 4.32.0 (157355)
Engine:
Version: 27.0.3
API version: 1.46 (minimum version 1.24)
Go version: go1.21.11
Git commit: 662f78c
Built: Sat Jun 29 00:02:44 2024
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: 1.7.18
GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
runc:
Version: 1.7.18
GitCommit: v1.1.13-0-g58aa920
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Version: 27.0.3
Context: desktop-linux
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.15.1-desktop.1
Path: /Users/Steve/.docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.28.1-desktop.1
Path: /Users/Steve/.docker/cli-plugins/docker-compose
debug: Get a shell into any image or container (Docker Inc.)
Version: 0.0.32
Path: /Users/Steve/.docker/cli-plugins/docker-debug
desktop: Docker Desktop commands (Alpha) (Docker Inc.)
Version: v0.0.14
Path: /Users/Steve/.docker/cli-plugins/docker-desktop
dev: Docker Dev Environments (Docker Inc.)
Version: v0.1.2
Path: /Users/Steve/.docker/cli-plugins/docker-dev
extension: Manages Docker extensions (Docker Inc.)
Version: v0.2.25
Path: /Users/Steve/.docker/cli-plugins/docker-extension
feedback: Provide feedback, right in your terminal! (Docker Inc.)
Version: v1.0.5
Path: /Users/Steve/.docker/cli-plugins/docker-feedback
init: Creates Docker-related starter files for your project (Docker Inc.)
Version: v1.3.0
Path: /Users/Steve/.docker/cli-plugins/docker-init
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
Version: 0.6.0
Path: /Users/Steve/.docker/cli-plugins/docker-sbom
scout: Docker Scout (Docker Inc.)
Version: v1.10.0
Path: /Users/Steve/.docker/cli-plugins/docker-scout
Server:
Containers: 93
Running: 2
Paused: 0
Stopped: 91
Images: 25
Server Version: 27.0.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
runc version: v1.1.13-0-g58aa920
init version: de40ad0
Security Options:
seccomp
Profile: unconfined
cgroupns
Kernel Version: 6.6.32-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: aarch64
CPUs: 12
Total Memory: 7.657GiB
Name: docker-desktop
ID: 2f557c69-e2d8-4116-ae42-9f59fe132ebf
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 85
Goroutines: 112
System Time: 2024-07-24T07:50:46.862436759Z
EventsListeners: 14
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Labels:
com.docker.desktop.address=unix:///Users/Steve/Library/Containers/com.docker.docker/Data/docker-cli.sock
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5555
127.0.0.0/8
Live Restore Enabled: false
WARNING: daemon is not using the default seccomp profile
Diagnostics ID
F0938E5A-F875-40C2-8E5B-FFF66ED4852D/20240724075446
Additional Info
No response
Hi @ruapehu15,
Thanks for reporting. I can't get what's happening right now and I'll need more details to have a clear picture of the situation.
Could you share your test-dump.pcap please? Or at least share the following details:
- What's the source and dest IP addresses, and src / dest MAC addresses?
- Is the dest IP address connected to the same bridge interface?
- If not, is the dest MAC address the one assigned to the gateway?
Even if you share that pcap file, could you answer these questions please and/or provide a pcap dump taken from the bridge:
- Is there any ARP request sent beforehand to resolve the target / next hop? (either by the current container, or by the bridge interface to discover the port associated to the gateway's MAC address)
- A copy of
docker network inspect ...for the network your container is connected to. - Is your container's interface put in promiscuous mode?
Could you also try fixing your strace output please? Unfortunately, it's not really readable right now as symbols aren't resolved.
Finally, I think it'd be worth trying to run cilium/pwru to see what kernel functions your packet(s) traverse. That'd really help me understand where the kernel is duplicating your packet.
I've built albinkerouanton006/pwru a week ago to run it on arm64. You can probably use it that way (maybe you'll need to tweak the pcap-filter argument):
$ docker run --privileged --rm -t --pid=host -v /sys/kernel/debug/:/sys/kernel/debug/ \
albinkerouanton006/pwru pwru --output-tuple 'sctp port 51011'
Hi Albin! thanks for helping with the case. I've re-run the test with a different pcap file, 'test-dump1.pcap', which contains a single TCP SYN packet. Have attached it here.
tcpdump inside the container shows:
root@34397420aa26:/pcap# tcpdump -nn -i eth0 tcp port 8443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:48:56.736330 IP 192.168.50.5.55919 > 192.168.50.1.8443: Flags [S], seq 2775783340, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 3331476469 ecr 0,sackOK,eol], length 0
12:48:56.737173 IP 192.168.50.5.55919 > 192.168.50.1.8443: Flags [S], seq 2775783340, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 3331476469 ecr 0,sackOK,eol], length 0
Source IP, dest IP, and MAC addresses are not linked to the container interface in any way. The way 'tcpreplay' works, it's just playing ethernet frames over the interface, so these should not really matter here. I am using it this way: tcpreplay plays a test pcap dump file, and application listens the same interface at the same time, then application parses it and compares with a test result (i.e. it's a functional test). So the pcap might be collected from a different host altogether. There is NO ARP request happening at the time - obviously this is again related to how tcpreplay works. 'docker network inspect' - have attached the result. The container interface is not in promiscuous mode. Have re-run the strace, see attached.
In regards to running your pwru build, I've tried it, but it fails:
❯ docker run --privileged --rm -t --pid=host -v /sys/kernel/debug/:/sys/kernel/debug/ albinkerouanton006/pwru pwru --output-tuple 'tcp port 8443'
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
2024/07/26 13:06:37 Failed to load objects:
program kprobe_skb_by_stackid: detecting kernel version: finding vDSO memory address: no vdso address found in auxv
Also am I supposed to run tcpreplay in the same container? if so, it would be missing in your image.
docker_network_inspect_info.log test-dump1.pcap.zip strace_result.log
Thanks for the pcap!
I was able to run tcpreplay myself and take a 'stacktrace' with pwru in parallel. I did that both from Docker Desktop and from a stock Engine built from moby/moby@master.
From Docker Desktop:
...
0xffff0000c2258fc0 0 tcpreplay:26452 4026531840 0 vethfdbdb1c:237 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) br_flood
0xffff0000c2258fc0 0 tcpreplay:26452 4026531840 0 vethfdbdb1c:237 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) maybe_deliver
0xffff0000c2258fc0 0 tcpreplay:26452 4026531840 0 vethfdbdb1c:237 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) nbp_switchdev_frame_mark_tx_fwd_to_hwdom
0xffff0000c2258fc0 0 tcpreplay:26452 4026531840 0 vethfdbdb1c:237 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) __br_forward
From the Engine:
0xffff0000c0cad200 8 <empty>:45609 4026532550 0 veth5a08d20:18 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) br_flood
0xffff0000c0cad200 8 <empty>:45609 4026532550 0 veth5a08d20:18 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) maybe_deliver
0xffff0000c0cad200 8 <empty>:45609 4026532550 0 veth5a08d20:18 0x0800 1500 64 192.168.50.5:55919->192.168.50.1:8443(tcp) kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED)
I believe this kfree_skb_reason comes from here, and __br_forward is called a few lines above. That means, prev is NULL in one case, but not in another. This makes me think that should_deliver (called by maybe_deliver here in the foreach loop in br_flood) doesn't return the same value for both case.
Now, looking at should_deliver, I see it tests a few things, including the forwarding state and the hairpin flag (see here).
Using iproute2's bridge util, I see the following:
root@docker-desktop:/# bridge -details link
339: veth2f1de0f@if338: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br-95470ffb1a79 state forwarding priority 32 cost 2
hairpin on guard off root_block off fastleave off learning on flood on mcast_flood on bcast_flood on mcast_router 1 mcast_to_unicast off neigh_suppress off vlan_tunnel off isolated off locked off
root@f48db5f55f0a:/# bridge -details link
18: veth5a08d20@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br-2e29be14e207 state forwarding priority 32 cost 2
hairpin off guard off root_block off fastleave off learning on flood on mcast_flood on bcast_flood on mcast_router 1 mcast_to_unicast off neigh_suppress off vlan_tunnel off isolated off locked off
Both interfaces have a different hairpin flag, so it seems I'm on the right track. Now, I'll need to determine why hairpin is set in one case but not another. I'll take another look tomorrow.
Heh, just remembered we do have something related to hairpinning in the Engine. A quick search gives a promising answer: https://github.com/moby/moby/blob/a43ed47441d57620d60dd450a037173bfab19703/libnetwork/drivers/bridge/bridge_linux.go#L1134-L1139
Docker Desktop doesn't have a 'userland proxy' (because it's not needed), so hairpinning is automatically enabled for bridge interfaces.
I'm pretty sure br_flood is called only because the bridge doesn't know which port is associated to the dest mac address. I think you'd need to manually update it (with your container's MAC address) before sending that frame to make sure you won't get a duplicate.
Or maybe try to create a tun/tap device and run your tests using that interface to be fully decoupled from docker's managed interfaces.
Hi Albin! thanks for the analysis. Unfortunately in my case I won't be able to manipulate MAC addresses, due to the way the development environment is set up. What's not clear to me though, is why the problem is NOT happening with docker under Linux and Windows? And only happens on mac?
I guess docker-proxy isn't disabled on Desktop for Linux / Windows, and that makes the hairpin mode to not be enabled on your container's bridge interface.
Could you try adding the following to your daemon.json (Settings > Docker Engine on the GUI, or ~/.docker/daemon.json):
"userland-proxy": "/usr/bin/docker-proxy"
I think it'll work but I'm not 100% confident.
indeed, the problem can be fixed on mac by enabling userland proxy in Docker app -> Settings -> Docker Engine:
{
"features": {
"buildkit": true
},
"userland-proxy": true
}
I guess this will be enough.. thanks for your help.