Cilium has issues with kubevirt specifying a custom interface MAC
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
Use case: I have a k3s/cilium cluster with kubevirt. If kubevirt VMs run in a bridge mode with a force specific mac address then cilium is lost about the pod's mac. I have to force a static MAC because otherwise the guest VMs' cloud-init is unhappy about seeing a brand new eth0 and bails out. #19789 was – supposedly – going to fix this in a general way but it looks dead.
kubevirt network config:
interfaces:
- bridge: {}
macAddress: ea:4b:c2:ab:2b:1e
name: default
guest$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.100.0.168 netmask 255.255.255.255 broadcast 10.100.0.168
inet6 fe80::e84b:c2ff:feab:2b1e prefixlen 64 scopeid 0x20<link>
ether ea:4b:c2:ab:2b:1e txqueuelen 1000 (Ethernet)
RX packets 36 bytes 4445 (4.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13 bytes 1166 (1.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
guest$ tcpdump -nvve icmp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:14:12.317772 ea:4b:c2:ab:2b:1e > fe:96:9e:22:f1:42, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 42818, offset 0, flags [DF], proto ICMP (1), length 84)
10.100.0.168 > 8.8.8.8: ICMP echo request, id 2, seq 15, length 64
09:14:12.324435 fe:96:9e:22:f1:42 > ea:4c:59:c8:64:25, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 117, id 0, offset 0, flags [none], proto ICMP (1), length 84)
8.8.8.8 > 10.100.0.168: ICMP echo reply, id 2, seq 15, length 64
You can see the returned ICMP is for ea:4c:59:c8:64:25 instead of ea:4b:c2:ab:2b:1e.
host$ cilium bpf endpoint list|grep 168
10.100.0.168:0 id=3768 flags=0x0000 ifindex=171 mac=EA:4C:59:C8:64:25 nodemac=FE:96:9E:22:F1:42
I'm not sure how it'd work with endpoint delete because that one crashes:
host$ cilium bpf endpoint delete
panic: runtime error: index out of range [0] with length 0
goroutine 1 [running]:
github.com/cilium/cilium/cilium/cmd.glob..func5(0x3b047a0?, {0x40a0430, 0x0, 0x0?})
/go/src/github.com/cilium/cilium/cilium/cmd/bpf_endpoint_delete.go:21 +0x138
github.com/spf13/cobra.(*Command).execute(0x3b047a0, {0x40a0430, 0x0, 0x0})
/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x3af9b20)
/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:902
github.com/cilium/cilium/cilium/cmd.Execute()
/go/src/github.com/cilium/cilium/cilium/cmd/root.go:36 +0x25
main.main()
/go/src/github.com/cilium/cilium/cilium/main.go:15 +0x17
command terminated with exit code 2
Cilium Version
Client: 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/amd64
Kernel Version
Linux h1 5.18.19 #1-NixOS SMP PREEMPT_DYNAMIC Sun Aug 21 13:18:56 UTC 2022 x86_64 GNU/Linux
Kubernetes Version
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.4+k3s1", GitCommit:"c3f830e9b9ed8a4d9d0e2aa663b4591b923a296e", GitTreeState:"clean", BuildDate:"1970-01-01T01:01:01Z", GoVersion:"go1.18.5", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I stumbled on #21347 and while this one seems similar, I think a persistent MAC is much less disruptive to the setup than a persistent pod IP. It's somewhat unexpected that cilium tracks the mac addresses of the pod-side interfaces on its own.
If kubevirt VMs run in a bridge mode with a force specific mac address then cilium is lost about the pod's mac. I have to force a static MAC because otherwise the guest VMs' cloud-init is unhappy about seeing a brand new eth0 and bails out.
Could you provide more details here? What is this "branch new eth0"? What do you mean by "force specific mac address"?
Could you also share a sysdump?
The way kubevirt works it bridges the pod-side interface and the qemu's host-side interfaces. So what you see is:
tcpdump within the guest:
10:30:59.712668 ea:4b:c2:ab:2b:1e > 02:26:d6:b8:35:f2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 44316, offset 0, flags [DF], proto ICMP (1), length 84)
10.100.0.96 > 8.8.8.8: ICMP echo request, id 1, seq 21, length 64
10:30:59.717700 02:26:d6:b8:35:f2 > ca:8b:e4:05:ff:a9, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 117, id 0, offset 0, flags [none], proto ICMP (1), length 84)
8.8.8.8 > 10.100.0.96: ICMP echo reply, id 1, seq 21, length 64
qemu guest:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.100.0.96 netmask 255.255.255.255 broadcast 10.100.0.96
inet6 fe80::e84b:c2ff:feab:2b1e prefixlen 64 scopeid 0x20<link>
ether ea:4b:c2:ab:2b:1e txqueuelen 1000 (Ethernet)
RX packets 52 bytes 11985 (11.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 35 bytes 3099 (3.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
pod network (nsenter into the pod):
eth0: flags=130<BROADCAST,NOARP> mtu 1500
inet 10.100.0.96 netmask 255.255.255.255 broadcast 0.0.0.0
ether 92:40:8e:6a:1e:3e txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0-nic: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::49b:7bff:fe8f:e9e6 prefixlen 64 scopeid 0x20<link>
ether 06:9b:7b:8f:e9:e6 txqueuelen 1000 (Ethernet)
RX packets 216 bytes 58120 (56.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 219 bytes 20152 (19.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
k6t-eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 169.254.75.10 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::49b:7bff:fe8f:e9e6 prefixlen 64 scopeid 0x20<link>
ether 06:9b:7b:8f:e9:e6 txqueuelen 0 (Ethernet)
RX packets 46 bytes 4564 (4.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 9 bytes 1369 (1.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tap0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::6c7d:7aff:fe84:2eef prefixlen 64 scopeid 0x20<link>
ether 6e:7d:7a:84:2e:ef txqueuelen 1000 (Ethernet)
RX packets 195 bytes 16981 (16.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 213 bytes 57737 (56.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
the bridge:
bridge name bridge id STP enabled interfaces
k6t-eth0 8000.069b7b8fe9e6 no eth0-nic
tap0
cilium's expectations:
# cilium bpf endpoint list|grep \\.96
10.100.0.96:0 id=256 flags=0x0000 ifindex=229 mac=CA:8B:E4:05:FF:A9 nodemac=02:26:D6:B8:35:F2
Kubevirt does some non-obvious things to the network in the pod:
- it renames eth0 to eth0-nic (note how the dummy eth0 is DOWN).
- it changes that interface's mac to be the same as of a newly created bridge.
- at this point the original cilium-issued mac is irrevocably lost.
- the rest seem to be reasonably harmless.
From the kubevirt's POV it seems that the expectation is that CNI will just do the ARP lookup and it will flow through the k6t-eth0 bridge. It works as expected from the pod's network ns:
# arping -I k6t-eth0 10.100.0.96
ARPING 10.100.0.96 from 169.254.75.10 k6t-eth0
Unicast reply from 10.100.0.96 [EA:4B:C2:AB:2B:1E] 0.876ms
It works fine with flannel because that's just one more bridge on top. I don't quite know why it works with calico and I don't have a calico CNI box at hand to try now.
I think that cilium's issue is that cilium absolutely expects the pod to never change its interface's MAC. I figured the issues with bpf endpoint delete and even when I delete the EP it comes back with the same cilium-issued mac.
A recap
- my VM must have a stable MAC because otherwise cloud-init goes insane.
- kubevirt can either masquerade the traffic (and then everything WAI, but all the VMs have the same internal IP 10.0.2.2), or bridge the pod network.
- in the latter case, kubevirt expects CNI to discover the pod's mac with arp. Cilium doesn't do that.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
not stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.