Running cilium causes kernel error: dead loop on virtual device cilium_vxlan, fix it urgently
Bug Report
I wanted to run Talos 1.7.5 and with Cilium, but it causes an error of:
Description
I tried with Cilium 1.15.7 and 1.16.0.
Cilium values are:
cluster:
name: "prod-buero"
id: 1
socketLB:
enabled: true
bgpControlPlane:
enabled: true
hubble:
enabled: true
relay:
enabled: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
metrics:
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- flow
- icmp
- http
serviceMonitor:
enabled: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
operator:
prometheus:
enabled: true
serviceMonitor:
enabled: true
kubeProxyReplacement: true
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cleanCiliumState:
- SYS_ADMIN
- SYS_RESOURCE
cgroup:
autoMount:
enabled: false
hostRoot: /sys/fs/cgroup
k8sServiceHost: localhost
k8sServicePort: 7445
envoy:
enabled: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
extraConfig:
external-dns-proxy: "true"
These are VMs running on Proxmox with the following specs:
The cluster runs for 24 hours without any issues, but then after ~ 24 hours, the errors start to appear.
Logs
Environment
- Talos version:
Client:
Tag: v1.7.5
SHA: 47731624
Built:
Go version: go1.22.4
OS/Arch: darwin/arm64
Server:
NODE: 192.168.20.11
Tag: v1.7.5
SHA: 47731624
Built:
Go version: go1.22.4
OS/Arch: linux/amd64
Enabled: RBAC
- Kubernetes version:
Server Version: v1.29.0
- Platform: Promox
You should raise with Cilium, why do you think it's a Talos issue?
Here's the kernel log line:
https://github.com/torvalds/linux/blob/v6.6/net/core/dev.c#L4384-L4385
While you're welcome to file an issue on cilium/cilium, I'm not sure the underlying kernel should allow this to happen. I can appreciate that Talos Linux folks may not have direct input on this, depending on whether there is anything custom in the Linux kernels provided by the distro. Could be an upstream kernel issue.
Talos runs with a vanilla Linux LTS kernel, so there might be an issue in the Linux kernel as well, but even figuring out why it happens will be way easier starting with Cilium. It might be a bug either way, or a misconfiguration, or incompatibility.
We test Cilium on Talos in default configurations, but Cilium itself is really big, so testing every possible feature is not possible.
We are pretty sure that Cilium itself works, but we can't verify every possible configuration.
I dont have this on my installation with Talos 1.7.6 and Cilium 1.16.0
Are you using L4 Load Balancing over Cilium? Then dont forget the CiliumL2AnnouncementPolicy
I dont have this on my installation with Talos 1.7.6 and Cilium 1.16.0
Are you using L4 Load Balancing over Cilium? Then dont forget the CiliumL2AnnouncementPolicy
No, I can re-install with default Helm values tomorrow.
So, I reconfigured the setup to use only:
kubeProxyReplacement: true
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cleanCiliumState:
- SYS_ADMIN
- SYS_RESOURCE
cgroup:
autoMount:
enabled: false
hostRoot: /sys/fs/cgroup
k8sServiceHost: localhost
k8sServicePort: 7445
Everything worked smoothly, but as soon as I restarted Cilium, the issue started to appear.
Before restarting Cilium, I could also see:
I see that I'm not the only one facing this issue: https://github.com/aenix-io/cozystack/issues/273
I dont have this problem with the newest talos version. You can try to upgrade and see if this helps. A upgade should also upgrade the linux kernel itself.
Does the cozystack some network tweaks? You can try this with a clean installation without this stack. Cilium should also inside the kube-system because of permissions.
I use Talos 1.7.6 with Cilium 1.16.0 with hubble and L2 Loadbalancing and Cilium Cluster Mesh enabled.
@Syntax3rror404 I already tested on the latest Talos and it also happened there. Are you also running on Proxmox?
I use it on vsphere but want to migrate to bare metal :D vsphere is also a hypervisor so this should be makes no difference to you're setup and talos is just build on top of a linux kernel. Things you can do is to add the qemu-guest-agent as system extension but this dont change anything about the network.
Again you can try to install it without this cozy thing. I use rancher on top of talos and also dont see this issue in the log.
My installation is without cozy. I just wanted to highlight that I'm not the only one facing this issue.
What makes you're sure that talos is the main factor for this issue? I use for example much more features as you and dont have any issue. It is due the factor that Talos is just a vanilla kernel extremly unlikely, that the main point is the talos itself. Look at you're network, network device, configuration etc and try it for example on a other machine.
@Syntax3rror404 I already tested on the latest Talos and it also happened there. Are you also running on Proxmox?
I'm also running Talos 1.7.6, Cilium 1.16 on top of Proxmox and I'm not having this issue because I'm running nativeRouting:
kubeProxyReplacement: true
bpf:
masquerade: true
ipam:
mode: kubernetes
operator:
clusterPoolIPv4PodCIDR: "<pod network>"
ipv4NativeRoutingCIDR: "<larger network>"
autoDirectNodeRoutes: true
routingMode: native
endpointRoutes:
enabled: true
installNoConntrackIptablesRules: true
bandwidthManager:
enabled: true
bbr: true
enableIPv4BIGTCP: true
l2announcements:
enabled: true
ipv4:
enabled: true
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETUID
- SETGID
cleanCiliumState:
- NET_ADMIN
- SYS_ADMIN
- SYS_RESOURCE
cgroup:
hostRoot: /sys/fs/cgroup
autoMount:
enabled: false
# Enable Cilium Ingress Controller
ingressController:
enabled: true
# Use KubePrism to access cluster API
k8sServiceHost: localhost
k8sServicePort: 7445
# Enable Hubble
hubble:
metrics:
enabled:
- dns
- drop
- tcp
- flow
- icmp
- http
relay:
enabled: true
ui:
enabled: true
I'll see if I can run a cluster with tunnelling enabled to replicate the issue
@bvansomeren
I'm on latest proxmox with 24/7 running and I migrate to a new bare metal cluster both with no cilium issues at all. Im not using native routing.
My Proxmox Talos cluster runs on Talos 1.7.6, Proxmox 8.2.4, Cilium 1.16.0 and that 24/7 with no issues at all and extremly performant.
I can also post my cilium values:
### cilium-values.yaml###
# make sure cilium uses the ipam configured in k8s
ipam:
mode: kubernetes
bgp:
enabled:
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
# use the kubeprism k8s api lb
k8sServiceHost: localhost
k8sServicePort: 7445
# enable the cilium kube proxy
kubeProxyReplacement: true
# security context that makes talos happy
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cleanCiliumState:
- NET_ADMIN
- SYS_ADMIN
- SYS_RESOURCE
cgroup:
autoMount:
enabled: false
hostRoot: /sys/fs/cgroup
# Enable the cilium l2 loadbalancer and the envoy ingress controller
l2announcements:
enabled: true
ingressController:
enabled: true
externalIPs:
enabled: true
ingressController:
enabled: true
# Create default secret before
# defaultSecretName:
# defaultSecretNamespace: cattle-system
service:
name: cilium-ingress
labels:
l2: active
type: LoadBalancer
loadBalancerIP : 192.168.35.70 # Free IP in my network for the envoy ingress controller
I can also share my l2 network policy
# cilium-lb.yaml
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: "network35"
spec:
blocks:
- cidr: "192.168.35.0/24" # Subnet cilium announcing
serviceSelector:
matchLabels:
l2: active # Useful if you have more IP pools
---
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
name: policyn35
spec:
externalIPs: true
loadBalancerIPs: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
interfaces:
- ^eth[0-9]+ # Matches interfaces like eth0, eth1, etc.
- ^enp[0-9]+s[0-9]+ # Matches interfaces like enp3s0, enp4s1, etc.
- ^ens[0-9]+ # Matches interfaces like ens3, ens4, etc.
- ^eno[0-9]+ # Matches interfaces like eno1, eno2, etc.
- ^enx[0-9a-fA-F]+ # Matches interfaces like enx001122334455
- ^wlan[0-9]+ # Matches interfaces like wlan0, wlan1, etc.
###- ^veth[0-9a-zA-Z]+ # Matches virtual ethernet interfaces like veth0, veth1, etc.
serviceSelector:
matchLabels:
l2: active
Again due the fact of that talos runs on a vanilla linux kernel makes a talos related bug extremly unlikely.
Closing this ticket. I could fix it by tweaking the cp and machine configs:
cluster:
network:
cni:
name: none
serviceSubnets:
- 172.0.0.0/16
proxy:
disabled: true
controlPlane:
endpoint: https://api.local
allowSchedulingOnControlPlanes: false
etcd:
advertisedSubnets:
- 192.168.20.0/24
machine:
kubelet:
nodeIP:
validSubnets:
- 192.168.20.0/24
clusterDNS:
- 172.0.0.10
install:
disk: ${install_disk}
network:
hostname: ${hostname}
sysctls:
net.ipv6.conf.all.disable_ipv6: 1
net.ipv6.conf.default.disable_ipv6: 1
net.ipv6.conf.lo.disable_ipv6: 1
I don't know which setting contributed to the fact that it's now working.
@darox I was having your exact issue. In my case, it went away setting ipam mode to "kubernetes" in cilium.