for-mac icon indicating copy to clipboard operation
for-mac copied to clipboard

Duplicate packets in bridge mode

Open ruapehu15 opened this issue 1 year ago • 2 comments

Description

When I run docker-compose setup, with network driver as bridge (by default), the following happens. Replay some pcap file with a single packet inside, e.g. with tcpreplay:

tcpreplay -i eth0 test-dump.pcap
Statistics for network device: eth0
	Successful packets:        1
	Failed packets:            0
	Truncated packets:         0
	Retried packets (ENOBUFS): 0
	Retried packets (EAGAIN):  0

when running tcpdump, it shows duplicate packets:

root@test:/pcap# tcpdump -i eth0 sctp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:07:24.071132 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
07:07:24.071340 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]

The packets shown are absolutely identical, except that timestamps are a few microseconds different.

The problem is not only with tcpdump, but with any program written with libpcap library.

When running stats on network interface, it shows that only 1 packet was transmitted, and 1 received:

before tcpreplay:

root@test:/pcap# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65535
        inet 172.80.0.3  netmask 255.255.255.0  broadcast 172.80.0.255
        ether 02:42:ac:50:00:03  txqueuelen 0  (Ethernet)
        **RX packets 7**  bytes 746 (746.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        **TX packets 0**  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

after tcpreplay:

root@test:/pcap# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65535
        inet 172.80.0.3  netmask 255.255.255.0  broadcast 172.80.0.255
        ether 02:42:ac:50:00:03  txqueuelen 0  (Ethernet)
        **RX packets 8**  bytes 880 (880.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        **TX packets 1**  bytes 134 (134.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

The problem happens under the following conditions:

  • only on mac with M1, M2 processors (does not happen on Linux or Windows).
  • only when default network with bridge driver is used. Not happening when host or macvlan driver is used.
  • happens with both native arm linux image, or x86_64 image running under Rosetta

Also ran strace on tcpdump, it shows:

syscall_0xeffff7dfb910(0xeffff7dfb918, 0x10204, 0, 0, 0xeffff7db4000, 0xde) = 0xf00008a81ec0
syscall_0xeffff7dfc460(0xeffff7dfc462, 0x10204, 0, 0, 0xeffff7db3000, 0xde) = 0xf00008a81ec0
strace: [ Process PID=16 runs in 64 bit mode. ]
rt_sigreturn({mask=[TRAP BUS FPE SEGV USR2 PIPE CHLD STOP TSTP TTIN URG XCPU XFSZ VTALRM PROF WINCH IO PWR SYS RTMIN RT_1 RT_2 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15]}) = 4294967281
syscall_0x2cb29feb1965965(0x94ff4b9653f96b4b, 0x6e594ff53cb29fe9, 0x61cb29feb5b9653f, 0, 0xeffff7db2000, 0xde) = 0xf00008a81ec0
strace: [ Process PID=16 runs in x32 mode. ]
syscall_0x7ffffeeb3760(0x7ffffeeb42a0, 0x8b, 0x7ffffeeb8760, 0, 0x555555606e70, 0x4008:06:03.656453 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
) = 0x1
syscall_0x7ffffeeb3760(0x7ffffeeb42a0, 0x8b, 0x7ffffeeb8760, 0, 0x555555606e70, 0x4008:06:03.656609 IP 1.2.3.4.17036 > 5.6.7.8.17036: sctp (1) [DATA] (B)(E) [TSN: 105938735] [SID: 10] [SSEQ 11014] [PPID M3UA]
) = 0x1

Reproduce

  1. docker run --privileged -it my-image-with-tcpdump /bin/bash
  2. docker exec -it $CONTAINER_ID sudo tcpreplay -i eth0 /pcap/ranap-single.pcap

Expected behavior

a single packet is played on the interface

docker version

Client:
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Fri Jun 28 23:59:41 2024
 OS/Arch:           darwin/arm64
 Context:           desktop-linux

Server: Docker Desktop 4.32.0 (157355)
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:44 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Version:    27.0.3
 Context:    desktop-linux
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.15.1-desktop.1
    Path:     /Users/Steve/.docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.28.1-desktop.1
    Path:     /Users/Steve/.docker/cli-plugins/docker-compose
  debug: Get a shell into any image or container (Docker Inc.)
    Version:  0.0.32
    Path:     /Users/Steve/.docker/cli-plugins/docker-debug
  desktop: Docker Desktop commands (Alpha) (Docker Inc.)
    Version:  v0.0.14
    Path:     /Users/Steve/.docker/cli-plugins/docker-desktop
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.1.2
    Path:     /Users/Steve/.docker/cli-plugins/docker-dev
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.25
    Path:     /Users/Steve/.docker/cli-plugins/docker-extension
  feedback: Provide feedback, right in your terminal! (Docker Inc.)
    Version:  v1.0.5
    Path:     /Users/Steve/.docker/cli-plugins/docker-feedback
  init: Creates Docker-related starter files for your project (Docker Inc.)
    Version:  v1.3.0
    Path:     /Users/Steve/.docker/cli-plugins/docker-init
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
    Version:  0.6.0
    Path:     /Users/Steve/.docker/cli-plugins/docker-sbom
  scout: Docker Scout (Docker Inc.)
    Version:  v1.10.0
    Path:     /Users/Steve/.docker/cli-plugins/docker-scout

Server:
 Containers: 93
  Running: 2
  Paused: 0
  Stopped: 91
 Images: 25
 Server Version: 27.0.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc version: v1.1.13-0-g58aa920
 init version: de40ad0
 Security Options:
  seccomp
   Profile: unconfined
  cgroupns
 Kernel Version: 6.6.32-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 12
 Total Memory: 7.657GiB
 Name: docker-desktop
 ID: 2f557c69-e2d8-4116-ae42-9f59fe132ebf
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 85
  Goroutines: 112
  System Time: 2024-07-24T07:50:46.862436759Z
  EventsListeners: 14
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Labels:
  com.docker.desktop.address=unix:///Users/Steve/Library/Containers/com.docker.docker/Data/docker-cli.sock
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5555
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: daemon is not using the default seccomp profile

Diagnostics ID

F0938E5A-F875-40C2-8E5B-FFF66ED4852D/20240724075446

Additional Info

No response

ruapehu15 avatar Jul 24 '24 07:07 ruapehu15

Hi @ruapehu15,

Thanks for reporting. I can't get what's happening right now and I'll need more details to have a clear picture of the situation.

Could you share your test-dump.pcap please? Or at least share the following details:

  • What's the source and dest IP addresses, and src / dest MAC addresses?
    • Is the dest IP address connected to the same bridge interface?
    • If not, is the dest MAC address the one assigned to the gateway?

Even if you share that pcap file, could you answer these questions please and/or provide a pcap dump taken from the bridge:

  • Is there any ARP request sent beforehand to resolve the target / next hop? (either by the current container, or by the bridge interface to discover the port associated to the gateway's MAC address)
  • A copy of docker network inspect ... for the network your container is connected to.
  • Is your container's interface put in promiscuous mode?

Could you also try fixing your strace output please? Unfortunately, it's not really readable right now as symbols aren't resolved.

Finally, I think it'd be worth trying to run cilium/pwru to see what kernel functions your packet(s) traverse. That'd really help me understand where the kernel is duplicating your packet.

I've built albinkerouanton006/pwru a week ago to run it on arm64. You can probably use it that way (maybe you'll need to tweak the pcap-filter argument):

$ docker run --privileged --rm -t --pid=host -v /sys/kernel/debug/:/sys/kernel/debug/ \
    albinkerouanton006/pwru pwru --output-tuple 'sctp port 51011'

akerouanton avatar Jul 26 '24 10:07 akerouanton

Hi Albin! thanks for helping with the case. I've re-run the test with a different pcap file, 'test-dump1.pcap', which contains a single TCP SYN packet. Have attached it here.

tcpdump inside the container shows:

root@34397420aa26:/pcap# tcpdump -nn -i eth0 tcp port 8443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:48:56.736330 IP 192.168.50.5.55919 > 192.168.50.1.8443: Flags [S], seq 2775783340, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 3331476469 ecr 0,sackOK,eol], length 0
12:48:56.737173 IP 192.168.50.5.55919 > 192.168.50.1.8443: Flags [S], seq 2775783340, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 3331476469 ecr 0,sackOK,eol], length 0

Source IP, dest IP, and MAC addresses are not linked to the container interface in any way. The way 'tcpreplay' works, it's just playing ethernet frames over the interface, so these should not really matter here. I am using it this way: tcpreplay plays a test pcap dump file, and application listens the same interface at the same time, then application parses it and compares with a test result (i.e. it's a functional test). So the pcap might be collected from a different host altogether. There is NO ARP request happening at the time - obviously this is again related to how tcpreplay works. 'docker network inspect' - have attached the result. The container interface is not in promiscuous mode. Have re-run the strace, see attached.

In regards to running your pwru build, I've tried it, but it fails:

❯ docker run --privileged --rm -t --pid=host -v /sys/kernel/debug/:/sys/kernel/debug/ albinkerouanton006/pwru pwru --output-tuple 'tcp port 8443'
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
2024/07/26 13:06:37 Failed to load objects:
program kprobe_skb_by_stackid: detecting kernel version: finding vDSO memory address: no vdso address found in auxv

Also am I supposed to run tcpreplay in the same container? if so, it would be missing in your image.

docker_network_inspect_info.log test-dump1.pcap.zip strace_result.log

ruapehu15 avatar Jul 26 '24 13:07 ruapehu15

Thanks for the pcap!

I was able to run tcpreplay myself and take a 'stacktrace' with pwru in parallel. I did that both from Docker Desktop and from a stock Engine built from moby/moby@master.

From Docker Desktop:

...
0xffff0000c2258fc0 0   tcpreplay:26452  4026531840 0        vethfdbdb1c:237  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) br_flood
0xffff0000c2258fc0 0   tcpreplay:26452  4026531840 0        vethfdbdb1c:237  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) maybe_deliver
0xffff0000c2258fc0 0   tcpreplay:26452  4026531840 0        vethfdbdb1c:237  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) nbp_switchdev_frame_mark_tx_fwd_to_hwdom
0xffff0000c2258fc0 0   tcpreplay:26452  4026531840 0        vethfdbdb1c:237  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) __br_forward

From the Engine:

0xffff0000c0cad200 8   <empty>:45609    4026532550 0         veth5a08d20:18  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) br_flood
0xffff0000c0cad200 8   <empty>:45609    4026532550 0         veth5a08d20:18  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) maybe_deliver
0xffff0000c0cad200 8   <empty>:45609    4026532550 0         veth5a08d20:18  0x0800 1500  64    192.168.50.5:55919->192.168.50.1:8443(tcp) kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED)

I believe this kfree_skb_reason comes from here, and __br_forward is called a few lines above. That means, prev is NULL in one case, but not in another. This makes me think that should_deliver (called by maybe_deliver here in the foreach loop in br_flood) doesn't return the same value for both case.

Now, looking at should_deliver, I see it tests a few things, including the forwarding state and the hairpin flag (see here).

Using iproute2's bridge util, I see the following:

root@docker-desktop:/# bridge -details link
339: veth2f1de0f@if338: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br-95470ffb1a79 state forwarding priority 32 cost 2
    hairpin on guard off root_block off fastleave off learning on flood on mcast_flood on bcast_flood on mcast_router 1 mcast_to_unicast off neigh_suppress off vlan_tunnel off isolated off locked off

root@f48db5f55f0a:/# bridge -details link
18: veth5a08d20@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br-2e29be14e207 state forwarding priority 32 cost 2
    hairpin off guard off root_block off fastleave off learning on flood on mcast_flood on bcast_flood on mcast_router 1 mcast_to_unicast off neigh_suppress off vlan_tunnel off isolated off locked off

Both interfaces have a different hairpin flag, so it seems I'm on the right track. Now, I'll need to determine why hairpin is set in one case but not another. I'll take another look tomorrow.

akerouanton avatar Jul 31 '24 19:07 akerouanton

Heh, just remembered we do have something related to hairpinning in the Engine. A quick search gives a promising answer: https://github.com/moby/moby/blob/a43ed47441d57620d60dd450a037173bfab19703/libnetwork/drivers/bridge/bridge_linux.go#L1134-L1139

Docker Desktop doesn't have a 'userland proxy' (because it's not needed), so hairpinning is automatically enabled for bridge interfaces.

I'm pretty sure br_flood is called only because the bridge doesn't know which port is associated to the dest mac address. I think you'd need to manually update it (with your container's MAC address) before sending that frame to make sure you won't get a duplicate.

Or maybe try to create a tun/tap device and run your tests using that interface to be fully decoupled from docker's managed interfaces.

akerouanton avatar Jul 31 '24 19:07 akerouanton

Hi Albin! thanks for the analysis. Unfortunately in my case I won't be able to manipulate MAC addresses, due to the way the development environment is set up. What's not clear to me though, is why the problem is NOT happening with docker under Linux and Windows? And only happens on mac?

ruapehu15 avatar Aug 01 '24 07:08 ruapehu15

I guess docker-proxy isn't disabled on Desktop for Linux / Windows, and that makes the hairpin mode to not be enabled on your container's bridge interface.

Could you try adding the following to your daemon.json (Settings > Docker Engine on the GUI, or ~/.docker/daemon.json):

    "userland-proxy": "/usr/bin/docker-proxy"

I think it'll work but I'm not 100% confident.

akerouanton avatar Aug 06 '24 23:08 akerouanton

indeed, the problem can be fixed on mac by enabling userland proxy in Docker app -> Settings -> Docker Engine:

{
  "features": {
    "buildkit": true
  },
  "userland-proxy": true
}

I guess this will be enough.. thanks for your help.

ruapehu15 avatar Aug 08 '24 15:08 ruapehu15