calico icon indicating copy to clipboard operation
calico copied to clipboard

Calico-kube-controllers CrashLoopBackOff

Open xiaomizhg opened this issue 3 years ago • 8 comments

Calico-kube-controllers Crash

Environment

  • Calico v3.24.5
  • Kubernetes 1.25.2
  • CentOs7.7

Problem Description

This is my first deployment of calico, using the https://docs.projectcalico.org/manifests/calico.yaml , Key Yaml Config: yaml

When the deployment is completed, the calico-kube-controllers pod is displayed as Crash Status。

pod状态

kubectl describe po calico-kube-controllers-767f55c9d8-ddj6n -n kube-system

po异常

Then I checked the calico-kube-controllers-767f55c9d8-ddj6n pod log

2022-12-21 15:14:41.721 [INFO][1] watchersyncer.go 130: Sending status update Status=in-sync
2022-12-21 15:14:41.721 [INFO][1] syncer.go 86: Node controller syncer status updated: in-sync
** 2022-12-21 15:14:41.722 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied **
2022-12-21 15:14:41.722 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2022-12-21 15:14:41.730 [INFO][1] hostendpoints.go 173: successfully synced all hostendpoints
I1221 15:14:41.818558       1 shared_informer.go:262] Caches are synced for nodes
I1221 15:14:41.818607       1 shared_informer.go:255] Waiting for caches to sync for pods
I1221 15:14:41.818630       1 shared_informer.go:262] Caches are synced for pods
2022-12-21 15:14:41.818 [INFO][1] ipam.go 253: Will run periodic IPAM sync every 7m30s
2022-12-21 15:14:41.819 [INFO][1] ipam.go 331: Syncer is InSync, kicking sync channel status=in-sync

Check the node status with the calicoctl: node-status

The node status should be normal, but calico-kube-controllers pod readinessProbe failure . I don't know what the problem is. Can someone help me ?

xiaomizhg avatar Dec 17 '22 03:12 xiaomizhg

Looks like a bug was somehow introduced into our status reporting logic - as a workaround, you can remove the readinessprobe from calico-kube-controllers, but obviously that's not a long term solution. We'll need to figure out why that location isn't writeable.

The offending log is:

Failed to write readiness file: open /status/status.json: permission denied

For the future, it's better to use plaintext for issues rather than images - it makes them more easily searchable.

caseydavenport avatar Dec 19 '22 18:12 caseydavenport

https://github.com/projectcalico/calico/blob/4c638885a88b2b598382b9d757b99ef90a135db2/kube-controllers/pkg/status/status.go#L136

writerFile permission is 0644, It seems right

cyclinder avatar Dec 20 '22 05:12 cyclinder

You can delete the pod "calico-kube-controllers",after regenerating the pod, it will return to normal.

pullusers avatar Jan 10 '23 08:01 pullusers

@xiaomizhg did you make any progress on this one?

caseydavenport avatar Jan 19 '23 17:01 caseydavenport

You can delete the pod "calico-kube-controllers",after regenerating the pod, it will return to normal.

@pullusers NO, it can't work

Same issue in k8s 1.23.6 rancher/mirrored-calico-kube-controllers:v3.22.0

2023-01-20 07:07:09.901 [INFO][1] main.go 97: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0120 07:07:09.903717       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2023-01-20 07:07:09.904 [INFO][1] main.go 118: Ensuring Calico datastore is initialized
2023-01-20 07:07:09.926 [INFO][1] main.go 159: Getting initial config snapshot from datastore
2023-01-20 07:07:09.943 [INFO][1] main.go 162: Got initial config snapshot
2023-01-20 07:07:09.944 [INFO][1] watchersyncer.go 89: Start called
2023-01-20 07:07:09.944 [INFO][1] main.go 179: Starting status report routine
2023-01-20 07:07:09.944 [INFO][1] main.go 188: Starting Prometheus metrics server on port 9094
2023-01-20 07:07:09.944 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2023-01-20 07:07:09.944 [INFO][1] main.go 463: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc0009e3cb0), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc000172460), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc0009e3c98), objectType:(*v1.Pod)(0xc000101000), resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x2e2b5b0), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil)}
2023-01-20 07:07:09.944 [INFO][1] main.go 463: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc0009e3cf8), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc0001724d0), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc0009e3ce0), objectType:(*v1.Node)(0xc00031cc00), resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x2e2b5b0), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil)}
2023-01-20 07:07:09.944 [INFO][1] syncer.go 78: Node controller syncer status updated: wait-for-ready
2023-01-20 07:07:09.944 [INFO][1] main.go 469: Starting controller ControllerType="Node"
2023-01-20 07:07:09.944 [INFO][1] controller.go 189: Starting Node controller
2023-01-20 07:07:09.944 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2023-01-20 07:07:09.944 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2023-01-20 07:07:09.944 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-01-20 07:07:09.944 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.944 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.944 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.944 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.944 [INFO][1] resources.go 350: Main client watcher loop
2023-01-20 07:07:09.946 [INFO][1] watchercache.go 273: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2023-01-20 07:07:09.947 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-01-20 07:07:09.947 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2023-01-20 07:07:09.947 [INFO][1] syncer.go 78: Node controller syncer status updated: resync
2023-01-20 07:07:09.947 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.947 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.949 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.949 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.949 [INFO][1] watchercache.go 273: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-01-20 07:07:09.949 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-01-20 07:07:09.949 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2023-01-20 07:07:09.949 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2023-01-20 07:07:09.949 [INFO][1] syncer.go 78: Node controller syncer status updated: in-sync
2023-01-20 07:07:09.958 [INFO][1] hostendpoints.go 177: successfully synced all hostendpoints
2023-01-20 07:07:10.044 [INFO][1] ipam.go 202: Will run periodic IPAM sync every 7m30s
2023-01-20 07:07:10.044 [INFO][1] ipam.go 280: Syncer is InSync, kicking sync channel status=in-sync
kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS          AGE
calico-kube-controllers-fc7fcb565-llgwd   0/1     Running   424 (5m55s ago)   25h
canal-ccjxw                               2/2     Running   0                 25h
coredns-58d67995c7-2kt8t                  1/1     Running   0                 25h
coredns-autoscaler-7d5478875b-6r2b2       1/1     Running   0                 25h
kube-eventer-66fcb6c6c6-hsv6n             1/1     Running   0                 25h
kube-flannel-5pcd2                        2/2     Running   0                 25h
metrics-server-5c4895ffbd-kggmc           1/1     Running   0                 25h

631068264 avatar Jan 20 '23 08:01 631068264

You can delete the pod "calico-kube-controllers",after regenerating the pod, it will return to normal.

This works for me.

tiansiyuan avatar Jun 06 '23 06:06 tiansiyuan

I had the same problem, deleting the pod and having k8s recreate it did not fix it). It turns out that the calico-kube-controller must run with some sort of privilege to write /status/status.json. My installation (rancher 2.7.3, k8s 1.25.7 on rocky 9, with calico-kube-controller v3.25.0) uses gatekeeper 'assign' resources as part of a mutating webhook that sets things like runAsUser: 65534, runAsGroup: 65534, privileged: false, allowprivilegeEscalation: false and others in a securityContext if they do not exist in a podSpec. If I exempt the calico-system namespace from said mutating webhook, The calico-kube-controller can write that file just fine. I have not had to exempt the k8s namespaces such as kube-system from this mutating webhook. That suggests the stuff running there explicity sets these things (but I have not looked in detail). It would be nice, especially for system-level stuff such as calico, if it would explicity set via a securityContext those privileges it needs to run. https://github.com/projectcalico/calico/blob/master/charts/calico/templates/calico-kube-controllers.yaml does not have any SecurityContext set. If it did, I would have caught what I needed to change alot sooner. With the advent of Pod Security Standards, along with kyverno and gatekeeper, it would be good for system level stuff such as calico to explicitly set privileges it needs to make integration easier as opposed to relying on defaults

caduceus4 avatar Jun 08 '23 18:06 caduceus4

I got the same issue, any update for it?

tongvt avatar Aug 08 '23 01:08 tongvt

i got the same issue Calico v3.25.1 Kubernetes 1.26.5 Ubuntu 22.04 It was working fine. Suddenly it failed. This seems to be a bug.

"calico-kube-controller crashloopbackoff -> status.go 137 failed to write status error=open /status/status.json: permission denied (readiness liveness)"

version v3.25.1 -> v3.25.2 image pull -> 1/1 running

Anyone know what part was modified?

3siksfather avatar Dec 18 '24 02:12 3siksfather

We updated how calico-kube-controllers permissions are handled in Calico v3.26.0 - please try updating to a newer version to see if that resolves the issue. If not, let's open another issue to track as the versions of Calico used here are quite old and there are a few different descriptions of the problem.

caseydavenport avatar Dec 30 '24 16:12 caseydavenport

I am also getting a similar error containing this status in microk8s.

calico-kube-controllers-5947598c79-z8zcd 0/1 CrashLoopBackOff

Calico version -> docker.io/calico/node:v3.28.1 Microk8s version -> MicroK8s v1.32.3 revision 8148 Ubuntu version -> Ubuntu 24.04.2 LTS


Problem Description

More than one pod controller is receiving CrashLoopBackOff.

Pods:

microk8s kubectl get pods -A 
kube-system      calico-kube-controllers-5947598c79-f99vb   0/1     CrashLoopBackOff             32 (2m36s ago)   102m
kube-system      calico-node-2cp8j                          1/1     Running                      1                92m
metallb-system   controller-7ffc454778-65llg                0/1     CrashLoopBackOff             23 (35s ago)     47m
metallb-system   speaker-llgsh                              0/1     CreateContainerConfigError   0                47m

Description check

microk8s kubectl describe pods -A | grep -A 20 -B 5 "Error\|Failed\|CrashLoopBackOff"
   Image:          docker.io/calico/kube-controllers:v3.28.1
    Image ID:       docker.io/calico/kube-controllers@sha256:eadb3a25109a5371b7b2dc30d74b2d9b2083eba58abdc034c81f974a48871330
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 25 Aug 2025 11:52:43 +0300
      Finished:     Mon, 25 Aug 2025 11:53:43 +0300
    Ready:          False
    Restart Count:  40
    Liveness:       exec [/usr/bin/check-status -l] delay=10s timeout=10s period=10s #success=1 #failure=6
    Readiness:      exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ENABLED_CONTROLLERS:  node
      DATASTORE_TYPE:       kubernetes
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqlpn (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
--
    Host Ports:    0/TCP, 0/TCP
    Args:
      --port=7472
      --log-level=info
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 25 Aug 2025 11:52:44 +0300
      Finished:     Mon, 25 Aug 2025 11:53:14 +0300
    Ready:          False
    Restart Count:  29
    Liveness:       http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      METALLB_ML_SECRET_NAME:  memberlist
      METALLB_DEPLOYMENT:      controller
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dxbg7 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
--
    Host Ports:    7472/TCP, 7946/TCP, 7946/UDP
    Args:
      --port=7472
      --log-level=info
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      METALLB_NODE_NAME:       (v1:spec.nodeName)
      METALLB_HOST:            (v1:status.hostIP)
      METALLB_ML_BIND_ADDR:    (v1:status.podIP)
      METALLB_ML_LABELS:      app=metallb,component=speaker
      METALLB_ML_SECRET_KEY:  <set to the key 'secretkey' in secret 'memberlist'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mc8z (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
--
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason  Age                   From     Message
  ----     ------  ----                  ----     -------
  Normal   Pulled  111s (x231 over 51m)  kubelet  Container image "quay.io/metallb/speaker:v0.13.3" already present on machine
  Warning  Failed  111s (x231 over 51m)  kubelet  Error: secret "memberlist" not found

Busybox check

microk8s` kubectl run -it --rm busybox --image=busybox:1.36 --restart=Never -- sh
If you don't see a command prompt, try pressing enter.
/ # 
/ # 
/ # wget -qO- https://10.152.183.1:443 --no-check-certificate
wget: can't connect to remote host (10.152.183.1): Connection timed out

Logs calico controllers

microk8s kubectl logs calico-kube-controllers-5947598c79-f99vb -n kube-system

2025-08-25 09:18:13.176 [INFO][1] main.go 99: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
2025-08-25 09:18:13.178 [WARNING][1] winutils.go 150: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2025-08-25 09:18:13.179 [INFO][1] main.go 123: Ensuring Calico datastore is initialized
2025-08-25 09:18:43.182 [ERROR][1] client.go 287: Error getting cluster information config ClusterInformation="default" error=Get "https://10.152.183.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.152.183.1:443: i/o timeout
2025-08-25 09:18:43.182 [INFO][1] main.go 130: Failed to initialize datastore error=Get "https://10.152.183.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.152.183.1:443: i/o timeout
2025-08-25 09:19:13.191 [ERROR][1] client.go 287: Error getting cluster information config ClusterInformation="default" error=Get "https://10.152.183.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.152.183.1:443: i/o timeout
2025-08-25 09:19:13.191 [INFO][1] main.go 130: Failed to initialize datastore error=Get "https://10.152.183.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.152.183.1:443: i/o timeout
2025-08-25 09:19:13.191 [FATAL][1] main.go 143: Failed to initialize Calico datastore

Logs metallb controller

microk8s kubectl logs controller-7ffc454778-65llg -n metallb-system

{"branch":"dev","caller":"main.go:141","commit":"dev","goversion":"gc / go1.18.3 / amd64","level":"info","msg":"MetalLB controller starting version 0.13.3 (commit dev, branch dev)","ts":"2025-08-25T09:26:44Z","version":"0.13.3"}

I opened a issiue; https://github.com/canonical/microk8s/issues/5197

FATIHISILGAN avatar Aug 25 '25 09:08 FATIHISILGAN