cilium Cilium has issues with kubevirt specifying a custom interface MAC

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

Use case: I have a k3s/cilium cluster with kubevirt. If kubevirt VMs run in a bridge mode with a force specific mac address then cilium is lost about the pod's mac. I have to force a static MAC because otherwise the guest VMs' cloud-init is unhappy about seeing a brand new eth0 and bails out. #19789 was – supposedly – going to fix this in a general way but it looks dead.

kubevirt network config:

          interfaces:
          - bridge: {}
            macAddress: ea:4b:c2:ab:2b:1e
            name: default

guest$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.100.0.168  netmask 255.255.255.255  broadcast 10.100.0.168
        inet6 fe80::e84b:c2ff:feab:2b1e  prefixlen 64  scopeid 0x20<link>
        ether ea:4b:c2:ab:2b:1e  txqueuelen 1000  (Ethernet)
        RX packets 36  bytes 4445 (4.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13  bytes 1166 (1.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

guest$ tcpdump -nvve icmp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:14:12.317772 ea:4b:c2:ab:2b:1e > fe:96:9e:22:f1:42, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 42818, offset 0, flags [DF], proto ICMP (1), length 84)
    10.100.0.168 > 8.8.8.8: ICMP echo request, id 2, seq 15, length 64
09:14:12.324435 fe:96:9e:22:f1:42 > ea:4c:59:c8:64:25, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 117, id 0, offset 0, flags [none], proto ICMP (1), length 84)
    8.8.8.8 > 10.100.0.168: ICMP echo reply, id 2, seq 15, length 64

You can see the returned ICMP is for ea:4c:59:c8:64:25 instead of ea:4b:c2:ab:2b:1e.

host$ cilium bpf endpoint list|grep 168
10.100.0.168:0                                id=3768  flags=0x0000 ifindex=171 mac=EA:4C:59:C8:64:25 nodemac=FE:96:9E:22:F1:42

I'm not sure how it'd work with endpoint delete because that one crashes:

host$ cilium bpf endpoint delete
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/cilium/cilium/cilium/cmd.glob..func5(0x3b047a0?, {0x40a0430, 0x0, 0x0?})
        /go/src/github.com/cilium/cilium/cilium/cmd/bpf_endpoint_delete.go:21 +0x138
github.com/spf13/cobra.(*Command).execute(0x3b047a0, {0x40a0430, 0x0, 0x0})
        /go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x3af9b20)
        /go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:902
github.com/cilium/cilium/cilium/cmd.Execute()
        /go/src/github.com/cilium/cilium/cilium/cmd/root.go:36 +0x25
main.main()
        /go/src/github.com/cilium/cilium/cilium/main.go:15 +0x17
command terminated with exit code 2

Cilium Version

Client: 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/amd64

Kernel Version

Linux h1 5.18.19 #1-NixOS SMP PREEMPT_DYNAMIC Sun Aug 21 13:18:56 UTC 2022 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.4+k3s1", GitCommit:"c3f830e9b9ed8a4d9d0e2aa663b4591b923a296e", GitTreeState:"clean", BuildDate:"1970-01-01T01:01:01Z", GoVersion:"go1.18.5", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Sep 17 '22 09:09 farcaller

I stumbled on #21347 and while this one seems similar, I think a persistent MAC is much less disruptive to the setup than a persistent pod IP. It's somewhat unexpected that cilium tracks the mac addresses of the pod-side interfaces on its own.

Sep 19 '22 09:09 farcaller

If kubevirt VMs run in a bridge mode with a force specific mac address then cilium is lost about the pod's mac. I have to force a static MAC because otherwise the guest VMs' cloud-init is unhappy about seeing a brand new eth0 and bails out.

Could you provide more details here? What is this "branch new eth0"? What do you mean by "force specific mac address"?

Could you also share a sysdump?

Sep 19 '22 10:09 pchaigno

The way kubevirt works it bridges the pod-side interface and the qemu's host-side interfaces. So what you see is:

tcpdump within the guest:

10:30:59.712668 ea:4b:c2:ab:2b:1e > 02:26:d6:b8:35:f2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 44316, offset 0, flags [DF], proto ICMP (1), length 84)
    10.100.0.96 > 8.8.8.8: ICMP echo request, id 1, seq 21, length 64
10:30:59.717700 02:26:d6:b8:35:f2 > ca:8b:e4:05:ff:a9, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 117, id 0, offset 0, flags [none], proto ICMP (1), length 84)
    8.8.8.8 > 10.100.0.96: ICMP echo reply, id 1, seq 21, length 64

qemu guest:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.100.0.96  netmask 255.255.255.255  broadcast 10.100.0.96
        inet6 fe80::e84b:c2ff:feab:2b1e  prefixlen 64  scopeid 0x20<link>
        ether ea:4b:c2:ab:2b:1e  txqueuelen 1000  (Ethernet)
        RX packets 52  bytes 11985 (11.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 35  bytes 3099 (3.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

pod network (nsenter into the pod):

eth0: flags=130<BROADCAST,NOARP>  mtu 1500
        inet 10.100.0.96  netmask 255.255.255.255  broadcast 0.0.0.0
        ether 92:40:8e:6a:1e:3e  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0-nic: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::49b:7bff:fe8f:e9e6  prefixlen 64  scopeid 0x20<link>
        ether 06:9b:7b:8f:e9:e6  txqueuelen 1000  (Ethernet)
        RX packets 216  bytes 58120 (56.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 219  bytes 20152 (19.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

k6t-eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 169.254.75.10  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::49b:7bff:fe8f:e9e6  prefixlen 64  scopeid 0x20<link>
        ether 06:9b:7b:8f:e9:e6  txqueuelen 0  (Ethernet)
        RX packets 46  bytes 4564 (4.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9  bytes 1369 (1.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tap0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::6c7d:7aff:fe84:2eef  prefixlen 64  scopeid 0x20<link>
        ether 6e:7d:7a:84:2e:ef  txqueuelen 1000  (Ethernet)
        RX packets 195  bytes 16981 (16.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 213  bytes 57737 (56.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

the bridge:

bridge name     bridge id               STP enabled     interfaces
k6t-eth0                8000.069b7b8fe9e6       no              eth0-nic
                                                        tap0

cilium's expectations:

# cilium bpf endpoint list|grep \\.96
10.100.0.96:0                                 id=256   flags=0x0000 ifindex=229 mac=CA:8B:E4:05:FF:A9 nodemac=02:26:D6:B8:35:F2

Kubevirt does some non-obvious things to the network in the pod:

it renames eth0 to eth0-nic (note how the dummy eth0 is DOWN).
it changes that interface's mac to be the same as of a newly created bridge.
at this point the original cilium-issued mac is irrevocably lost.
the rest seem to be reasonably harmless.

From the kubevirt's POV it seems that the expectation is that CNI will just do the ARP lookup and it will flow through the k6t-eth0 bridge. It works as expected from the pod's network ns:

# arping -I k6t-eth0 10.100.0.96
ARPING 10.100.0.96 from 169.254.75.10 k6t-eth0
Unicast reply from 10.100.0.96 [EA:4B:C2:AB:2B:1E]  0.876ms

It works fine with flannel because that's just one more bridge on top. I don't quite know why it works with calico and I don't have a calico CNI box at hand to try now.

I think that cilium's issue is that cilium absolutely expects the pod to never change its interface's MAC. I figured the issues with bpf endpoint delete and even when I delete the EP it comes back with the same cilium-issued mac.

A recap

my VM must have a stable MAC because otherwise cloud-init goes insane.
kubevirt can either masquerade the traffic (and then everything WAI, but all the VMs have the same internal IP 10.0.2.2), or bridge the pod network.
in the latter case, kubevirt expects CNI to discover the pod's mac with arp. Cilium doesn't do that.

Sep 19 '22 10:09 farcaller

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Nov 21 '22 02:11 github-actions[bot]

not stale

Nov 22 '22 11:11 farcaller

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Jan 23 '23 02:01 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

Feb 07 '23 02:02 github-actions[bot]