microk8s
microk8s copied to clipboard
When Worker or Master Nodes gets shutdown they stuck at NotReady
Summary
When stopping (restarting) a node in 1.30 we have regularly the issue that it does not get Ready again in the cluster.
microk8s inspect shows FAIL: Service snap.microk8s.daemon-kubelite is not running
What Should Happen Instead?
The node should come online without issues.
Reproduction Steps
The reproduction is not consistant so it is related to the start of the node or the shutdown before that.
- Setup 8 Debian 12 cloud image VMs
- Create a 4 Master 4 Worker Node HA cluster.
- Shutdown one of the nodes and check the Node to get ready.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-test-m1 Ready <none> 79d v1.30.1
k8s-test-m2 Ready <none> 79d v1.30.1
k8s-test-m3 Ready <none> 79d v1.30.1
k8s-test-m4 Ready <none> 79d v1.30.1
k8s-test-n1 NotReady <none> 39h v1.30.1
k8s-test-n2 NotReady <none> 39h v1.30.1
k8s-test-n3 Ready <none> 69d v1.30.1
k8s-test-n4 Ready <none> 39h v1.30.1
On both nodes, kubernetes related pods just stay on running as status. And all application pods are Terminating.
ceph-csi-cephfs ceph-csi-cephfs-nodeplugin-npz5d 3/3 Running 0 39h 10.14.214.41 k8s-test-n1 <none> <none>
ceph-csi-rbd ceph-csi-rbd-nodeplugin-wvjdn 3/3 Running 0 39h 10.14.214.41 k8s-test-n1 <none> <none>
kube-system calico-node-7qbzl 1/1 Running 0 39h 10.14.214.41 k8s-test-n1 <none> <none>
metallb-system speaker-82qn4 1/1 Running 0 39h 10.14.214.41 k8s-test-n1 <none> <none>
ceph-csi-cephfs ceph-csi-cephfs-nodeplugin-hjlwq 3/3 Running 0 39h 10.14.214.42 k8s-test-n2 <none> <none>
ceph-csi-rbd ceph-csi-rbd-nodeplugin-zbnmp 3/3 Running 0 39h 10.14.214.42 k8s-test-n2 <none> <none>
kube-system calico-node-qrdt2 1/1 Running 0 39h 10.14.214.42 k8s-test-n2 <none> <none>
metallb-system speaker-p74qt 1/1 Running 0 39h 10.14.214.42 k8s-test-n2 <none> <none>
× snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite
Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelite.service; enabled; preset: enabled)
Drop-In: /etc/systemd/system/snap.microk8s.daemon-kubelite.service.d
└─delegate.conf
Active: failed (Result: exit-code) since Fri 2024-07-19 13:10:13 UTC; 13min ago
Duration: 426ms
Process: 1860 ExecStart=/usr/bin/snap run microk8s.daemon-kubelite (code=exited, status=255/EXCEPTION)
Main PID: 1860 (code=exited, status=255/EXCEPTION)
CPU: 467ms
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 5.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: Stopped snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Start request repeated too quickly.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: Failed to start snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.
snap.microk8s.daemon-kubelite.service.txt
after restarting the microk8s worker node manually with sudo snap stop microk8s
and sudo snap start microk8s
it recovered and reconnected:
ceph-csi-cephfs ceph-csi-cephfs-nodeplugin-npz5d 3/3 Running 3 (14m ago) 39h 10.14.214.41 k8s-test-n1 <none> <none>
ceph-csi-rbd ceph-csi-rbd-nodeplugin-wvjdn 3/3 Running 3 (14m ago) 39h 10.14.214.41 k8s-test-n1 <none> <none>
kube-system calico-node-7qbzl 1/1 Running 1 (14m ago) 39h 10.14.214.41 k8s-test-n1 <none> <none>
metallb-system speaker-82qn4 1/1 Running 1 (14m ago) 39h 10.14.214.41 k8s-test-n1 <none> <none>
ceph-csi-cephfs ceph-csi-cephfs-nodeplugin-hjlwq 3/3 Running 3 (14m ago) 39h 10.14.214.42 k8s-test-n2 <none> <none>
ceph-csi-rbd ceph-csi-rbd-nodeplugin-zbnmp 3/3 Running 3 (14m ago) 39h 10.14.214.42 k8s-test-n2 <none> <none>
kube-system calico-node-qrdt2 1/1 Running 1 (14m ago) 39h 10.14.214.42 k8s-test-n2 <none> <none>
metallb-system speaker-p74qt 1/1 Running 1 (14m ago) 39h 10.14.214.42 k8s-test-n2 <none> <none>
If restarting the affected node does not work, removing and adding the node again is the only thing which helps.
Introspection Report
todo
Can you suggest a fix?
not at this moment
Are you interested in contributing with a fix?
yes, very happy to test and collaborate further on problem finding