talos icon indicating copy to clipboard operation
talos copied to clipboard

Running cilium causes kernel error: dead loop on virtual device cilium_vxlan, fix it urgently

Open darox opened this issue 1 year ago • 14 comments

Bug Report

I wanted to run Talos 1.7.5 and with Cilium, but it causes an error of: image

Description

I tried with Cilium 1.15.7 and 1.16.0.

Cilium values are:

cluster:
  name: "prod-buero"
  id: 1
socketLB:
  enabled: true
bgpControlPlane:
  enabled: true
hubble:
  enabled: true
  relay:
    enabled: true
    prometheus:
      enabled: true
      serviceMonitor:
        enabled: true
  metrics:
    enabled: 
      - dns:query;ignoreAAAA
      - drop
      - tcp
      - flow
      - icmp
      - http
    serviceMonitor:
      enabled: true
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
operator:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
kubeProxyReplacement: true
securityContext:
  capabilities:
    ciliumAgent: 
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState: 
      - SYS_ADMIN
      - SYS_RESOURCE
cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup
k8sServiceHost: localhost
k8sServicePort: 7445
envoy:
  enabled: true
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
extraConfig:
  external-dns-proxy: "true"

These are VMs running on Proxmox with the following specs: image

The cluster runs for 24 hours without any issues, but then after ~ 24 hours, the errors start to appear.

Logs

support.zip

Environment

  • Talos version:
Client:
        Tag:         v1.7.5
        SHA:         47731624
        Built:       
        Go version:  go1.22.4
        OS/Arch:     darwin/arm64
Server:
        NODE:        192.168.20.11
        Tag:         v1.7.5
        SHA:         47731624
        Built:       
        Go version:  go1.22.4
        OS/Arch:     linux/amd64
        Enabled:     RBAC
  • Kubernetes version:
Server Version: v1.29.0
  • Platform: Promox

darox avatar Aug 03 '24 14:08 darox

You should raise with Cilium, why do you think it's a Talos issue?

smira avatar Aug 05 '24 12:08 smira

Here's the kernel log line:

https://github.com/torvalds/linux/blob/v6.6/net/core/dev.c#L4384-L4385

While you're welcome to file an issue on cilium/cilium, I'm not sure the underlying kernel should allow this to happen. I can appreciate that Talos Linux folks may not have direct input on this, depending on whether there is anything custom in the Linux kernels provided by the distro. Could be an upstream kernel issue.

joestringer avatar Aug 05 '24 17:08 joestringer

Talos runs with a vanilla Linux LTS kernel, so there might be an issue in the Linux kernel as well, but even figuring out why it happens will be way easier starting with Cilium. It might be a bug either way, or a misconfiguration, or incompatibility.

We test Cilium on Talos in default configurations, but Cilium itself is really big, so testing every possible feature is not possible.

We are pretty sure that Cilium itself works, but we can't verify every possible configuration.

smira avatar Aug 05 '24 17:08 smira

I dont have this on my installation with Talos 1.7.6 and Cilium 1.16.0

Are you using L4 Load Balancing over Cilium? Then dont forget the CiliumL2AnnouncementPolicy

Syntax3rror404 avatar Aug 09 '24 03:08 Syntax3rror404

I dont have this on my installation with Talos 1.7.6 and Cilium 1.16.0

Are you using L4 Load Balancing over Cilium? Then dont forget the CiliumL2AnnouncementPolicy

No, I can re-install with default Helm values tomorrow.

darox avatar Aug 12 '24 16:08 darox

So, I reconfigured the setup to use only:

kubeProxyReplacement: true
securityContext:
  capabilities:
    ciliumAgent: 
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState: 
      - SYS_ADMIN
      - SYS_RESOURCE
cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup
k8sServiceHost: localhost
k8sServicePort: 7445

Everything worked smoothly, but as soon as I restarted Cilium, the issue started to appear.

Before restarting Cilium, I could also see: Screenshot 2024-08-14 at 08 50 18

darox avatar Aug 14 '24 07:08 darox

I see that I'm not the only one facing this issue: https://github.com/aenix-io/cozystack/issues/273

darox avatar Aug 14 '24 07:08 darox

I dont have this problem with the newest talos version. You can try to upgrade and see if this helps. A upgade should also upgrade the linux kernel itself.

Does the cozystack some network tweaks? You can try this with a clean installation without this stack. Cilium should also inside the kube-system because of permissions.

I use Talos 1.7.6 with Cilium 1.16.0 with hubble and L2 Loadbalancing and Cilium Cluster Mesh enabled.

Syntax3rror404 avatar Aug 14 '24 13:08 Syntax3rror404

@Syntax3rror404 I already tested on the latest Talos and it also happened there. Are you also running on Proxmox?

darox avatar Aug 14 '24 14:08 darox

I use it on vsphere but want to migrate to bare metal :D vsphere is also a hypervisor so this should be makes no difference to you're setup and talos is just build on top of a linux kernel. Things you can do is to add the qemu-guest-agent as system extension but this dont change anything about the network.

Again you can try to install it without this cozy thing. I use rancher on top of talos and also dont see this issue in the log.

Syntax3rror404 avatar Aug 14 '24 16:08 Syntax3rror404

My installation is without cozy. I just wanted to highlight that I'm not the only one facing this issue.

darox avatar Aug 15 '24 12:08 darox

What makes you're sure that talos is the main factor for this issue? I use for example much more features as you and dont have any issue. It is due the factor that Talos is just a vanilla kernel extremly unlikely, that the main point is the talos itself. Look at you're network, network device, configuration etc and try it for example on a other machine.

Syntax3rror404 avatar Aug 16 '24 22:08 Syntax3rror404

@Syntax3rror404 I already tested on the latest Talos and it also happened there. Are you also running on Proxmox?

I'm also running Talos 1.7.6, Cilium 1.16 on top of Proxmox and I'm not having this issue because I'm running nativeRouting:


kubeProxyReplacement: true
bpf:
  masquerade: true
ipam:
  mode: kubernetes
  operator:
    clusterPoolIPv4PodCIDR: "<pod network>"
ipv4NativeRoutingCIDR: "<larger network>"
autoDirectNodeRoutes: true
routingMode: native
endpointRoutes:
  enabled: true
installNoConntrackIptablesRules: true
bandwidthManager:
  enabled: true
  bbr: true
enableIPv4BIGTCP: true
l2announcements:
  enabled: true
ipv4:
  enabled: true
securityContext:
  capabilities:
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETUID
      - SETGID
    cleanCiliumState:
      - NET_ADMIN
      - SYS_ADMIN
      - SYS_RESOURCE
cgroup:
  hostRoot: /sys/fs/cgroup
  autoMount:
    enabled: false
# Enable Cilium Ingress Controller
ingressController:
  enabled: true
# Use KubePrism to access cluster API
k8sServiceHost: localhost
k8sServicePort: 7445
# Enable Hubble
hubble:
  metrics:
    enabled:
    - dns
    - drop
    - tcp
    - flow
    - icmp
    - http
  relay:
    enabled: true
  ui:
    enabled: true

I'll see if I can run a cluster with tunnelling enabled to replicate the issue

bvansomeren avatar Aug 18 '24 19:08 bvansomeren

@bvansomeren

I'm on latest proxmox with 24/7 running and I migrate to a new bare metal cluster both with no cilium issues at all. Im not using native routing.

My Proxmox Talos cluster runs on Talos 1.7.6, Proxmox 8.2.4, Cilium 1.16.0 and that 24/7 with no issues at all and extremly performant.

I can also post my cilium values:

### cilium-values.yaml###
# make sure cilium uses the ipam configured in k8s
ipam:
  mode: kubernetes
bgp:
  enabled: 
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true
# use the kubeprism k8s api lb
k8sServiceHost: localhost
k8sServicePort: 7445
# enable the cilium kube proxy
kubeProxyReplacement: true
# security context that makes talos happy
securityContext:
  capabilities: 
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState:
      - NET_ADMIN
      - SYS_ADMIN
      - SYS_RESOURCE
cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup
# Enable the cilium l2 loadbalancer and the envoy ingress controller
l2announcements:
  enabled: true
ingressController:
  enabled: true
externalIPs:
  enabled: true
ingressController:
  enabled: true
  # Create default secret before 
  # defaultSecretName: 
  # defaultSecretNamespace: cattle-system
  service:
    name: cilium-ingress
    labels: 
      l2: active
    type: LoadBalancer
    loadBalancerIP : 192.168.35.70 # Free IP in my network for the envoy ingress controller

I can also share my l2 network policy

# cilium-lb.yaml
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "network35"
spec:
  blocks:
  - cidr: "192.168.35.0/24" # Subnet cilium announcing
  serviceSelector:
    matchLabels:
      l2: active # Useful if you have more IP pools
---
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: policyn35
spec:
  externalIPs: true
  loadBalancerIPs: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: DoesNotExist
  interfaces:
  - ^eth[0-9]+          # Matches interfaces like eth0, eth1, etc.
  - ^enp[0-9]+s[0-9]+   # Matches interfaces like enp3s0, enp4s1, etc.
  - ^ens[0-9]+          # Matches interfaces like ens3, ens4, etc.
  - ^eno[0-9]+          # Matches interfaces like eno1, eno2, etc.
  - ^enx[0-9a-fA-F]+    # Matches interfaces like enx001122334455
  - ^wlan[0-9]+         # Matches interfaces like wlan0, wlan1, etc.
  ###- ^veth[0-9a-zA-Z]+   # Matches virtual ethernet interfaces like veth0, veth1, etc.
  serviceSelector:
    matchLabels:
      l2: active

Again due the fact of that talos runs on a vanilla linux kernel makes a talos related bug extremly unlikely.

Syntax3rror404 avatar Aug 18 '24 21:08 Syntax3rror404

Closing this ticket. I could fix it by tweaking the cp and machine configs:

cluster:
  network:
    cni:
      name: none
    serviceSubnets:
      - 172.0.0.0/16
  proxy:
    disabled: true
  controlPlane:
    endpoint: https://api.local
  allowSchedulingOnControlPlanes: false
  etcd:
    advertisedSubnets:
      - 192.168.20.0/24
machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 192.168.20.0/24
    clusterDNS:
      - 172.0.0.10
  install:
    disk: ${install_disk}
  network:
    hostname: ${hostname}
  sysctls:
    net.ipv6.conf.all.disable_ipv6: 1
    net.ipv6.conf.default.disable_ipv6: 1
    net.ipv6.conf.lo.disable_ipv6: 1

I don't know which setting contributed to the fact that it's now working.

darox avatar Oct 23 '24 13:10 darox

@darox I was having your exact issue. In my case, it went away setting ipam mode to "kubernetes" in cilium.

lfornili avatar Oct 29 '24 10:10 lfornili