k3s icon indicating copy to clipboard operation
k3s copied to clipboard

K3s pod stuck in CrashLoopBackOff when restarting the k3s

Open adeel14553 opened this issue 8 months ago • 2 comments

Environmental Info: K3s Version: k3s version v1.27.7+k3s2 (575bce76)

Node(s) CPU architecture, OS, and Version: Linux devops-S2H 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 1 cluster node

Describe the bug: From what its look like i think the coredns can't maintain the readyness prob, idk where to start from. And also it would be hectic to do it again.

Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-11-14 09:32:32 PKT; 1min 22s ago
Docs: https://k3s.io/
Process: 1506 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Process: 1512 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 1531 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 1532 (k3s-server)
Tasks: 123
Memory: 672.4M
CPU: 20.201s
CGroup: /system.slice/k3s.service
├─1532 "/usr/local/bin/k3s server"
├─1590 "containerd " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
├─2587 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
├─2589 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
├─4252 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
├─5003 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
└─7145 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
Nov 14 09:33:38 devops-S2H k3s[1532]: E1114 09:33:38.163459 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:38 devops-S2H k3s[1532]: I1114 09:33:38.858242 1532 scope.go:115] "RemoveContainer" containerID="57de58b245f8019d804999d3dd1dab05>
Nov 14 09:33:38 devops-S2H k3s[1532]: E1114 09:33:38.858629 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:39 devops-S2H k3s[1532]: I1114 09:33:39.859960 1532 scope.go:115] "RemoveContainer" containerID="57de58b245f8019d804999d3dd1dab05>
Nov 14 09:33:39 devops-S2H k3s[1532]: E1114 09:33:39.860243 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:43 devops-S2H k3s[1532]: I1114 09:33:43.376508 1532 scope.go:115] "RemoveContainer" containerID="57de58b245f8019d804999d3dd1dab05>
Nov 14 09:33:43 devops-S2H k3s[1532]: E1114 09:33:43.376879 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:54 devops-S2H k3s[1532]: E1114 09:33:54.702863 1532 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc er>
Nov 14 09:33:54 devops-S2H k3s[1532]: I1114 09:33:54.884418 1532 pod_container_deletor.go:80] "Container not found in pod's containers" contai>
Nov 14 09:33:54 devops-S2H k3s[1532]: I1114 09:33:54.884437 1532 scope.go:115] "RemoveContainer" containerID="0c53169c12a17fa6620bbbd71af65e48> ```

```devops@devops-S2H:~$ k get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-traefik-crd-xf222 0/1 Completed 0 21h
kube-system helm-install-traefik-7gb4t 0/1 Completed 1 21h
kube-system svclb-traefik-c8627e25-zk6xm 2/2 Running 2 (113s ago) 21h
kube-system traefik-768bdcdcdd-jddb8 1/1 Running 2 (106s ago) 21h
kube-system coredns-77ccd57875-qkpj4 0/1 CrashLoopBackOff 3 (47s ago) 21h
kube-system nvidia-device-plugin-daemonset-c8snh 0/1 CrashLoopBackOff 2 (15s ago) 19h
kube-system metrics-server-648b5df564-gtx64 0/1 Completed 2 21h
kube-system local-path-provisioner-957fdf8bc-xq5h5 0/1 CrashLoopBackOff 1 (3s ago) 21h ```


```devops@devops-S2H:~$ journalctl -eu k3s | tail
Nov 14 10:13:51 devops-S2H k3s[18609]: I1114 10:13:51.913011   18609 scope.go:115] "RemoveContainer" containerID="96f2636dccf5199cf5581722f092bc1647ddddc169499276d9f372f7cc44ba3f"
Nov 14 10:13:51 devops-S2H k3s[18609]: E1114 10:13:51.913147   18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-c8snh_kube-system(3ca38f2b-5628-4b83-a763-b6405849e4f9)\"" pod="kube-system/nvidia-device-plugin-daemonset-c8snh" podUID=3ca38f2b-5628-4b83-a763-b6405849e4f9
Nov 14 10:13:53 devops-S2H k3s[18609]: I1114 10:13:53.872948   18609 scope.go:115] "RemoveContainer" containerID="bd1050656f10213235ebf482f6d97582c682784f2600db44f6c6dc7924c60a80"
Nov 14 10:13:53 devops-S2H k3s[18609]: E1114 10:13:53.873402   18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"traefik\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=traefik pod=traefik-768bdcdcdd-jddb8_kube-system(f25e4db5-b11d-4ddb-bef4-c801ba915e8d)\"" pod="kube-system/traefik-768bdcdcdd-jddb8" podUID=f25e4db5-b11d-4ddb-bef4-c801ba915e8d
Nov 14 10:13:56 devops-S2H k3s[18609]: I1114 10:13:56.874072   18609 scope.go:115] "RemoveContainer" containerID="d488793eaaefd0ae4fcd6186a98a2e0992d700d3ebf2004fd617b4b91953c559"
Nov 14 10:13:56 devops-S2H k3s[18609]: E1114 10:13:56.874377   18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"local-path-provisioner\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=local-path-provisioner pod=local-path-provisioner-957fdf8bc-xq5h5_kube-system(5f692e55-29f3-477f-bb91-601f37df419c)\"" pod="kube-system/local-path-provisioner-957fdf8bc-xq5h5" podUID=5f692e55-29f3-477f-bb91-601f37df419c
Nov 14 10:13:57 devops-S2H k3s[18609]: I1114 10:13:57.916646   18609 scope.go:115] "RemoveContainer" containerID="37266e0ace9c9c45d95531c2df50e0c0a7dd6fcd71e2939cff0f80c40647842f"
Nov 14 10:13:57 devops-S2H k3s[18609]: E1114 10:13:57.916994   18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=coredns pod=coredns-77ccd57875-qkpj4_kube-system(ef53650a-5c42-43e2-b06a-8bff928baae9)\"" pod="kube-system/coredns-77ccd57875-qkpj4" podUID=ef53650a-5c42-43e2-b06a-8bff928baae9
Nov 14 10:13:57 devops-S2H k3s[18609]: E1114 10:13:57.996381   18609 resource_quota_controller.go:441] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1
Nov 14 10:13:58 devops-S2H k3s[18609]: W1114 10:13:58.560515   18609 garbagecollector.go:816] failed to discover some groups: map[metrics.k8s.io/v1beta1:stale GroupVersion discovery: metrics.k8s.io/v1beta1]```

```devops@devops-S2H:~$ kubectl logs coredns-68db8c5f9f-pjfg4 -n kube-system --follow
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
.:53
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[INFO] plugin/reload: Running configuration SHA512 = b941b080e5322f6519009bb49349462c7ddb6317425b0f6a83e5451175b720703949e3f3b454a24e77f3ffe57fd5e9c6130e528a5a1dd00d9000e4afd6c1108d
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3
[ERROR] plugin/errors: 2 8158796437723755526.3000482851770165230. HINFO: read udp 10.42.0.170:40649->172.16.10.4:53: i/o timeout
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server```

```Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       13m                   default-scheduler  Successfully assigned kube-system/coredns-68db8c5f9f-hbkvt to devops-s2h
  Warning  Unhealthy       13m (x2 over 13m)     kubelet            Readiness probe failed: Get "http://10.42.0.221:8181/ready": dial tcp 10.42.0.221:8181: connect: connection refused
  Warning  Unhealthy       13m                   kubelet            Readiness probe failed: Get "http://10.42.0.221:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy       13m                   kubelet            Readiness probe failed: Get "http://10.42.0.222:8181/ready": dial tcp 10.42.0.222:8181: connect: connection refused
  Normal   SandboxChanged  13m (x2 over 13m)     kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  Unhealthy       13m                   kubelet            Readiness probe failed: Get "http://10.42.0.222:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Pulled          12m (x3 over 13m)     kubelet            Container image "rancher/mirrored-coredns-coredns:1.10.1" already present on machine
  Normal   Created         12m (x3 over 13m)     kubelet            Created container coredns
  Normal   Started         12m (x3 over 13m)     kubelet            Started container coredns
  Warning  Unhealthy       12m                   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy       12m                   kubelet            Readiness probe failed: Get "http://10.42.0.223:8181/ready": dial tcp 10.42.0.223:8181: connect: connection refused
  Normal   Killing         8m20s (x4 over 13m)   kubelet            Stopping container coredns
  Warning  BackOff         3m18s (x20 over 13m)  kubelet            Back-off restarting failed container coredns in pod coredns-68db8c5f9f-hbkvt_kube-system(9c77b91b-a1c8-4b41-9b96-e3b1c7508692)```

**Steps To Reproduce:**
Installed the k3s with the help of script, if i killall and uninstall the k3s and installed again it works fine, but i thinks thats not a best way to go.

adeel14553 avatar Nov 14 '23 05:11 adeel14553

@adeel14553 Can you add more information around creating the cluster, also is this problem reproducible on every new node or was it a one time thing?

galal-hussein avatar Jan 10 '24 21:01 galal-hussein

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar Feb 25 '24 20:02 github-actions[bot]