k3s
k3s copied to clipboard
K3s pod stuck in CrashLoopBackOff when restarting the k3s
Environmental Info: K3s Version: k3s version v1.27.7+k3s2 (575bce76)
Node(s) CPU architecture, OS, and Version: Linux devops-S2H 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 1 cluster node
Describe the bug: From what its look like i think the coredns can't maintain the readyness prob, idk where to start from. And also it would be hectic to do it again.
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-11-14 09:32:32 PKT; 1min 22s ago
Docs: https://k3s.io/
Process: 1506 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Process: 1512 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 1531 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 1532 (k3s-server)
Tasks: 123
Memory: 672.4M
CPU: 20.201s
CGroup: /system.slice/k3s.service
├─1532 "/usr/local/bin/k3s server"
├─1590 "containerd " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
├─2587 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
├─2589 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
├─4252 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
├─5003 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
└─7145 /var/lib/rancher/k3s/data/bf3548384eaabb3435bf08112f1b0cba1afc5add6a6f2f2372aa2906a598fd04/bin/containerd-shim-runc-v2 -names>
Nov 14 09:33:38 devops-S2H k3s[1532]: E1114 09:33:38.163459 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:38 devops-S2H k3s[1532]: I1114 09:33:38.858242 1532 scope.go:115] "RemoveContainer" containerID="57de58b245f8019d804999d3dd1dab05>
Nov 14 09:33:38 devops-S2H k3s[1532]: E1114 09:33:38.858629 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:39 devops-S2H k3s[1532]: I1114 09:33:39.859960 1532 scope.go:115] "RemoveContainer" containerID="57de58b245f8019d804999d3dd1dab05>
Nov 14 09:33:39 devops-S2H k3s[1532]: E1114 09:33:39.860243 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:43 devops-S2H k3s[1532]: I1114 09:33:43.376508 1532 scope.go:115] "RemoveContainer" containerID="57de58b245f8019d804999d3dd1dab05>
Nov 14 09:33:43 devops-S2H k3s[1532]: E1114 09:33:43.376879 1532 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartCont>
Nov 14 09:33:54 devops-S2H k3s[1532]: E1114 09:33:54.702863 1532 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc er>
Nov 14 09:33:54 devops-S2H k3s[1532]: I1114 09:33:54.884418 1532 pod_container_deletor.go:80] "Container not found in pod's containers" contai>
Nov 14 09:33:54 devops-S2H k3s[1532]: I1114 09:33:54.884437 1532 scope.go:115] "RemoveContainer" containerID="0c53169c12a17fa6620bbbd71af65e48> ```
```devops@devops-S2H:~$ k get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-traefik-crd-xf222 0/1 Completed 0 21h
kube-system helm-install-traefik-7gb4t 0/1 Completed 1 21h
kube-system svclb-traefik-c8627e25-zk6xm 2/2 Running 2 (113s ago) 21h
kube-system traefik-768bdcdcdd-jddb8 1/1 Running 2 (106s ago) 21h
kube-system coredns-77ccd57875-qkpj4 0/1 CrashLoopBackOff 3 (47s ago) 21h
kube-system nvidia-device-plugin-daemonset-c8snh 0/1 CrashLoopBackOff 2 (15s ago) 19h
kube-system metrics-server-648b5df564-gtx64 0/1 Completed 2 21h
kube-system local-path-provisioner-957fdf8bc-xq5h5 0/1 CrashLoopBackOff 1 (3s ago) 21h ```
```devops@devops-S2H:~$ journalctl -eu k3s | tail
Nov 14 10:13:51 devops-S2H k3s[18609]: I1114 10:13:51.913011 18609 scope.go:115] "RemoveContainer" containerID="96f2636dccf5199cf5581722f092bc1647ddddc169499276d9f372f7cc44ba3f"
Nov 14 10:13:51 devops-S2H k3s[18609]: E1114 10:13:51.913147 18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-c8snh_kube-system(3ca38f2b-5628-4b83-a763-b6405849e4f9)\"" pod="kube-system/nvidia-device-plugin-daemonset-c8snh" podUID=3ca38f2b-5628-4b83-a763-b6405849e4f9
Nov 14 10:13:53 devops-S2H k3s[18609]: I1114 10:13:53.872948 18609 scope.go:115] "RemoveContainer" containerID="bd1050656f10213235ebf482f6d97582c682784f2600db44f6c6dc7924c60a80"
Nov 14 10:13:53 devops-S2H k3s[18609]: E1114 10:13:53.873402 18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"traefik\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=traefik pod=traefik-768bdcdcdd-jddb8_kube-system(f25e4db5-b11d-4ddb-bef4-c801ba915e8d)\"" pod="kube-system/traefik-768bdcdcdd-jddb8" podUID=f25e4db5-b11d-4ddb-bef4-c801ba915e8d
Nov 14 10:13:56 devops-S2H k3s[18609]: I1114 10:13:56.874072 18609 scope.go:115] "RemoveContainer" containerID="d488793eaaefd0ae4fcd6186a98a2e0992d700d3ebf2004fd617b4b91953c559"
Nov 14 10:13:56 devops-S2H k3s[18609]: E1114 10:13:56.874377 18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"local-path-provisioner\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=local-path-provisioner pod=local-path-provisioner-957fdf8bc-xq5h5_kube-system(5f692e55-29f3-477f-bb91-601f37df419c)\"" pod="kube-system/local-path-provisioner-957fdf8bc-xq5h5" podUID=5f692e55-29f3-477f-bb91-601f37df419c
Nov 14 10:13:57 devops-S2H k3s[18609]: I1114 10:13:57.916646 18609 scope.go:115] "RemoveContainer" containerID="37266e0ace9c9c45d95531c2df50e0c0a7dd6fcd71e2939cff0f80c40647842f"
Nov 14 10:13:57 devops-S2H k3s[18609]: E1114 10:13:57.916994 18609 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=coredns pod=coredns-77ccd57875-qkpj4_kube-system(ef53650a-5c42-43e2-b06a-8bff928baae9)\"" pod="kube-system/coredns-77ccd57875-qkpj4" podUID=ef53650a-5c42-43e2-b06a-8bff928baae9
Nov 14 10:13:57 devops-S2H k3s[18609]: E1114 10:13:57.996381 18609 resource_quota_controller.go:441] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1
Nov 14 10:13:58 devops-S2H k3s[18609]: W1114 10:13:58.560515 18609 garbagecollector.go:816] failed to discover some groups: map[metrics.k8s.io/v1beta1:stale GroupVersion discovery: metrics.k8s.io/v1beta1]```
```devops@devops-S2H:~$ kubectl logs coredns-68db8c5f9f-pjfg4 -n kube-system --follow
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
.:53
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[INFO] plugin/reload: Running configuration SHA512 = b941b080e5322f6519009bb49349462c7ddb6317425b0f6a83e5451175b720703949e3f3b454a24e77f3ffe57fd5e9c6130e528a5a1dd00d9000e4afd6c1108d
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3
[ERROR] plugin/errors: 2 8158796437723755526.3000482851770165230. HINFO: read udp 10.42.0.170:40649->172.16.10.4:53: i/o timeout
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server```
```Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned kube-system/coredns-68db8c5f9f-hbkvt to devops-s2h
Warning Unhealthy 13m (x2 over 13m) kubelet Readiness probe failed: Get "http://10.42.0.221:8181/ready": dial tcp 10.42.0.221:8181: connect: connection refused
Warning Unhealthy 13m kubelet Readiness probe failed: Get "http://10.42.0.221:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 13m kubelet Readiness probe failed: Get "http://10.42.0.222:8181/ready": dial tcp 10.42.0.222:8181: connect: connection refused
Normal SandboxChanged 13m (x2 over 13m) kubelet Pod sandbox changed, it will be killed and re-created.
Warning Unhealthy 13m kubelet Readiness probe failed: Get "http://10.42.0.222:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal Pulled 12m (x3 over 13m) kubelet Container image "rancher/mirrored-coredns-coredns:1.10.1" already present on machine
Normal Created 12m (x3 over 13m) kubelet Created container coredns
Normal Started 12m (x3 over 13m) kubelet Started container coredns
Warning Unhealthy 12m kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 12m kubelet Readiness probe failed: Get "http://10.42.0.223:8181/ready": dial tcp 10.42.0.223:8181: connect: connection refused
Normal Killing 8m20s (x4 over 13m) kubelet Stopping container coredns
Warning BackOff 3m18s (x20 over 13m) kubelet Back-off restarting failed container coredns in pod coredns-68db8c5f9f-hbkvt_kube-system(9c77b91b-a1c8-4b41-9b96-e3b1c7508692)```
**Steps To Reproduce:**
Installed the k3s with the help of script, if i killall and uninstall the k3s and installed again it works fine, but i thinks thats not a best way to go.
@adeel14553 Can you add more information around creating the cluster, also is this problem reproducible on every new node or was it a one time thing?
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.