cilium-service-mesh-beta
cilium-service-mesh-beta copied to clipboard
regeneration-recovery is failing since
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
I have upgraded an existing instance of cilium in a cluster to use the new service mesh images and settings; To do this I did the following:
git clone https://github.com/cilium/cilium.git
cd cilium
git checkout origin/beta/service-mesh
helm upgrade -n kube-system cilium ./install/kubernetes/cilium --values=../<dir-path>/values.yaml
This produced a successful install helm output. I then subsequently ran a cilium status and get the following errors:
cilium status ✔
/¯¯\
/¯¯\__/¯¯\ Cilium: 47 errors
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Hubble: OK
\__/¯¯\__/ ClusterMesh: disabled
\__/
Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet cilium Desired: 6, Ready: 6/6, Available: 6/6
Deployment cilium-operator Desired: 2, Ready: 2/2, Available: 2/2
Containers: cilium Running: 6
cilium-operator Running: 2
hubble-relay Running: 1
hubble-ui Running: 1
Cluster Pods: 90/90 managed by Cilium
Image versions cilium quay.io/cilium/cilium-service-mesh:v1.11.0-beta.1: 6
cilium-operator quay.io/cilium/operator-generic-service-mesh:v1.11.0-beta.1: 2
hubble-relay quay.io/cilium/hubble-relay-service-mesh:v1.11.0-beta.1: 1
hubble-ui docker.io/envoyproxy/envoy:v1.18.4@sha256:e5c2bb2870d0e59ce917a5100311813b4ede96ce4eb0c6bfa879e3fbe3e83935: 1
hubble-ui quay.io/cilium/hubble-ui:v0.8.3@sha256:018ed122968de658d8874e2982fa6b3a8ae64b43d2356c05f977004176a89310: 1
hubble-ui quay.io/cilium/hubble-ui-backend:v0.8.3@sha256:13a16ed3ae9749682c817d3b834b2f2de901da6fb41de7753d7dce16650982b3: 1
Errors: cilium cilium-sh8rz controller endpoint-708-regeneration-recovery is failing since 2m39s (317x): regeneration recovery failed
cilium cilium-sh8rz controller endpoint-3306-regeneration-recovery is failing since 2m34s (317x): regeneration recovery failed
cilium cilium-5mnm5 controller endpoint-396-regeneration-recovery is failing since 2m44s (317x): regeneration recovery failed
cilium cilium-5mnm5 controller endpoint-3141-regeneration-recovery is failing since 2m44s (317x): regeneration recovery failed
This agent output continues for all agents ofcourse.
Checking the agent logs I can see it flooded with the following:
level=warning msg="libbpf: map 'cilium_calls_00058': error reusing pinned map" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_00058': failed to create: Invalid argument(-22)" subsys=datapath-loader
level=warning msg="libbpf: failed to load object '58_next/bpf_lxc.o'" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program" containerID=d0113b584c datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=58 error="Failed to load prog with tc: exit status 1" file-path=58_next/bpf_lxc.o identity=41106 ipv4=10.0.0.58 ipv6= k8sPodName=network-system/node-feature-discovery-worker-zdtdv subsys=datapath-loader veth=lxc30e9e94e930d
level=error msg="Error while rewriting endpoint BPF program" containerID=d0113b584c datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=58 error="Failed to load prog with tc: exit status 1" identity=41106 ipv4=10.0.0.58 ipv6= k8sPodName=network-system/node-feature-discovery-worker-zdtdv subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID=d0113b584c datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=58 file-path=58_next_fail identity=41106 ipv4=10.0.0.58 ipv6= k8sPodName=network-system/node-feature-discovery-worker-zdtdv subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=106.417467ms bpfWaitForELF="14.093µs" bpfWriteELF=5.40179ms containerID=d0113b584c datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=58 error="Failed to load prog with tc: exit status 1" identity=41106 ipv4=10.0.0.58 ipv6= k8sPodName=network-system/node-feature-discovery-worker-zdtdv mapSync="8.796µs" policyCalculation="48.87µs" prepareBuild=1.70903ms proxyConfiguration="28.481µs" proxyPolicyCalculation="176.405µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=119.07863ms waitingForCTClean="1.204µs" waitingForLock="73.388µs"
level=error msg="endpoint regeneration failed" containerID=d0113b584c datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=58 error="Failed to load prog with tc: exit status 1" identity=41106 ipv4=10.0.0.58 ipv6= k8sPodName=network-system/node-feature-discovery-worker-zdtdv subsys=endpoint
level=error msg="Command execution failed" cmd="[tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 3037_next/bpf_host.o sec to-host]" error="exit status 1" subsys=datapath-loader
level=warning msg="libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/cilium_calls_hostns_03037': parameter mismatch" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_03037': error reusing pinned map" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_03037': failed to create: Invalid argument(-22)" subsys=datapath-loader
level=warning msg="libbpf: failed to load object '3037_next/bpf_host.o'" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program for host endpoint (to-host)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=3037 error="Failed to load prog with tc: exit status 1" file-path=3037_next/bpf_host.o identity=1 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=cilium_host
level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=3037 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=3037 file-path=3037_next_fail identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=143.574912ms bpfWaitForELF="20.982µs" bpfWriteELF=6.394646ms containerID= datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=3037 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ mapSync="9.888µs" policyCalculation="27.333µs" prepareBuild=2.842736ms proxyConfiguration="33.314µs" proxyPolicyCalculation="8.815µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=161.941749ms waitingForCTClean="1.686µs" waitingForLock="3.741µs"
level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=11 endpointID=3037 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
These logs are somewhat repetitive so I've only grabbed the last few entries.
Cilium Version
v1.11.0
Kernel Version
Linux k8s-controlplane-01 5.11.0-1007-raspi #7-Ubuntu SMP PREEMPT Wed Apr 14 22:08:05 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
It's worth mentioning this is the oldest kernel version being used on any of the fleet of nodes.
Kubernetes Version
v1.22.2
Sysdump
ip
Relevant log output
No response
Anything else?
I started by initially doing a helm upgrade ... then performed an uninstall using helm and the CLI to ensure all things were removed and attempted a fresh install however, ended with the same results.
I also attempted a rollback however, it now continually appears to have this issue of not entirely starting - I'm not sure what the state of impact is with these logs as services continue to start and be routable through previous methods.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for the report! Could you please share a Cilium sysdump?
I recently bumped into this with a GKE cluster. sysdump attached.
I used the cli to install.
Kernel is 5.10, using the GKE COS.
I swapped my GKE cluster from COS to Ubuntu and have made progress.
Just hit this issue on EKS
/¯¯\__/¯¯\ Cilium: 4 errors
\__/¯¯\__/ Operator: 1 errors
/¯¯\__/¯¯\ Hubble: disabled
\__/¯¯\__/ ClusterMesh: disabled
\__/
DaemonSet cilium Desired: 2, Ready: 2/2, Available: 2/2
Deployment cilium-operator Desired: 1, Unavailable: 1/1
Containers: cilium Running: 2
cilium-operator Running: 1
Cluster Pods: 2/2 managed by Cilium
Image versions cilium quay.io/cilium/cilium-service-mesh:v1.11.0-beta.1@sha256:4252b95ce4d02f5b772fd7756d240e3c036e6c9a19e3d77bae9c3fa31c837e50: 2
cilium-operator quay.io/cilium/operator-generic-service-mesh:v1.11.0-beta.1@sha256:dcf364d807e26bc3a62fc8190e6ca40b40e9fceb71c7a934e34cbf24d5a9bfa8: 1
Errors: cilium cilium-glppv controller cilium-health-ep is failing since 25s (41x): Get "http://192.168.73.203:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cilium cilium-mks64 controller endpoint-2857-regeneration-recovery is failing since 1m35s (46x): regeneration recovery failed
cilium cilium-mks64 controller endpoint-260-regeneration-recovery is failing since 1m35s (46x): regeneration recovery failed
cilium cilium-mks64 controller cilium-health-ep is failing since 24s (41x): Get "http://192.168.3.52:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cilium-operator cilium-operator 1 pods of Deployment cilium-operator are not ready
Community report that deleting some of the pinned bpf programs on every node with sudo rm -rf /sys/fs/bpf/tc//globals/* after uninstall and before reinstall will help
I forgot to provide my values as I'm installing it from the helm chart on the service mesh branch; @pchaigno if you can vet this looks accurate, I did have some questions wrt using autoDirectNodeRoutes and tunnel: disabled but again I wasn't sure if that would be the case for baremetal clusters.
image:
repository: quay.io/cilium/cilium-service-mesh
tag: v1.11.0-beta.1
useDigest: false
extraConfig:
enable-envoy-config: "true"
# -- Enable installation of PodCIDR routes between worker
# nodes if worker nodes share a common L2 network segment.
autoDirectNodeRoutes: true
# Cilium leverages MetalLB's simplified BGP announcement system for service type: LoadBalancer
bgp:
enabled: false
announce:
loadbalancerIP: true
nodePort:
# -- Enable the Cilium NodePort service implementation.
enabled: true
# -- Port range to use for NodePort services.
range: "30000,32767"
containerRuntime:
integration: containerd
endpointRoutes:
# -- Enable use of per endpoint routes instead of routing via
# the cilium_host interface.
enabled: false
# -- Enables masquerading of IPv4 traffic leaving the node from endpoints.
enableIPv4Masquerade: true
# -- Enables masquerading of IPv6 traffic leaving the node from endpoints.
enableIPv6Masquerade: true
# masquerade enables masquerading of traffic leaving the node for
# destinations outside of the cluster.
masquerade: true
hubble:
# -- Enable Hubble (true by default).
enabled: true
# Enables the provided list of Hubble metrics.
metrics:
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- flow
- port-distribution
- icmp
- http
listenAddress: ':4244'
relay:
# -- Enable Hubble Relay (requires hubble.enabled=true)
enabled: true
image:
repository: quay.io/cilium/hubble-relay-service-mesh
tag: v1.11.0-beta.1
useDigest: false
# -- Roll out Hubble Relay pods automatically when configmap is updated.
rollOutPods: true
ui:
# -- Whether to enable the Hubble UI.
enabled: true
# -- Roll out Hubble-ui pods automatically when configmap is updated.
rollOutPods: true
ipam:
# -- Configure IP Address Management mode.
# ref: https://docs.cilium.io/en/stable/concepts/networking/ipam/
mode: "kubernetes"
operator:
# -- Deprecated in favor of ipam.operator.clusterPoolIPv4PodCIDRList.
# IPv4 CIDR range to delegate to individual nodes for IPAM.
clusterPoolIPv4PodCIDR: "10.0.0.0/8"
# -- IPv4 CIDR list range to delegate to individual nodes for IPAM.
clusterPoolIPv4PodCIDRList: ["10.0.0.0/8"]
# -- IPv4 CIDR mask size to delegate to individual nodes for IPAM.
clusterPoolIPv4MaskSize: 24
# -- Deprecated in favor of ipam.operator.clusterPoolIPv6PodCIDRList.
# IPv6 CIDR range to delegate to individual nodes for IPAM.
clusterPoolIPv6PodCIDR: "fd00::/104"
# -- IPv6 CIDR list range to delegate to individual nodes for IPAM.
clusterPoolIPv6PodCIDRList: ["fd00::/104"]
# -- IPv6 CIDR mask size to delegate to individual nodes for IPAM.
clusterPoolIPv6MaskSize: 120
ipv6:
# -- Enable IPv6 support.
enabled: false
# kubeProxyReplacement enables kube-proxy replacement in Cilium BPF datapath
# Disabled due to RockPi kernel <= 4.4
# Valid options are "disabled", "probe", "partial", "strict".
# ref: https://docs.cilium.io/en/stable/gettingstarted/kubeproxy-free/
kubeProxyReplacement: strict
# kubeProxyReplacement healthz server bind address
# To enable set the value to '0.0.0.0:10256' for all ipv4
# addresses and this '[::]:10256' for all ipv6 addresses.
# By default it is disabled.
# Can't be used as RockPi Kernel is <=4.4
kubeProxyReplacementHealthzBindAddr: '0.0.0.0:10256'
# prometheus enables serving metrics on the configured port at /metrics
# Enables metrics for cilium-agent.
prometheus:
enabled: true
port: 9090
# This requires the prometheus CRDs to be available (see https://github.com/prometheus-operator/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml)
serviceMonitor:
enabled: false
operator:
image:
repository: quay.io/cilium/operator
tag: v1.11.0-beta.1
useDigest: false
suffix: "-service-mesh"
# -- Roll out cilium-operator pods automatically when configmap is updated.
rollOutPods: true
# Enables metrics for cilium-operator.
prometheus:
enabled: true
serviceMonitor:
enabled: false
# kubeConfigPath: ~/.kube/config
k8sServiceHost: 192.168.1.205
k8sServicePort: 6443
# -- Specify the IPv4 CIDR for native routing (ie to avoid IP masquerade for).
# This value corresponds to the configured cluster-cidr.
# Deprecated in favor of ipv4NativeRoutingCIDR, will be removed in 1.12.
nativeRoutingCIDR: 10.0.0.0/8
# -- Specify the IPv4 CIDR for native routing (ie to avoid IP masquerade for).
# This value corresponds to the configured cluster-cidr.
ipv4NativeRoutingCIDR: 10.0.0.0/8
# tunnel is the encapsulation configuration for communication between nodes
tunnel: disabled
# loadBalancer is the general configuration for service load balancing
loadBalancer:
# algorithm is the name of the load balancing algorithm for backend
# selection e.g. random or maglev
algorithm: maglev
# mode is the operation mode of load balancing for remote backends
# e.g. snat, dsr, hybrid
# https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/#hybrid-dsr-and-snat-mode
# Fixes UDP Client Source IP Preservation for Local traffic
mode: hybrid
# disableEnvoyVersionCheck removes the check for Envoy, which can be useful on
# AArch64 as the images do not currently ship a version of Envoy.
disableEnvoyVersionCheck: false
cluster:
# -- Name of the cluster. Only required for Cluster Mesh.
name: default
# -- (int) Unique ID of the cluster. Must be unique across all connected
# clusters and in the range of 1 to 255. Only required for Cluster Mesh.
id:
clustermesh:
# -- Deploy clustermesh-apiserver for clustermesh
useAPIServer: true
apiserver:
# -- Clustermesh API server image.
image:
repository: quay.io/cilium/clustermesh-apiserver
tag: v1.11.2
etcd:
# -- Clustermesh API server etcd image.
image:
repository: quay.io/coreos/etcd
tag: v3.5.2
pullPolicy: IfNotPresent
# -- Roll out cilium agent pods automatically when configmap is updated.
rollOutCiliumPods: true
externalIPs:
# -- Enable ExternalIPs service support.
enabled: true
hostPort:
# -- Enable hostPort service support.
enabled: false
# -- Configure ClusterIP service handling in the host namespace (the node).
hostServices:
# -- Enable host reachable services.
enabled: true
# -- Supported list of protocols to apply ClusterIP translation to.
protocols: tcp,udp