microk8s icon indicating copy to clipboard operation
microk8s copied to clipboard

When Worker or Master Nodes gets shutdown they stuck at NotReady

Open Aaron-Ritter opened this issue 7 months ago • 2 comments

Summary

When stopping (restarting) a node in 1.30 we have regularly the issue that it does not get Ready again in the cluster.

microk8s inspect shows FAIL: Service snap.microk8s.daemon-kubelite is not running

What Should Happen Instead?

The node should come online without issues.

Reproduction Steps

The reproduction is not consistant so it is related to the start of the node or the shutdown before that.

  1. Setup 8 Debian 12 cloud image VMs
  2. Create a 4 Master 4 Worker Node HA cluster.
  3. Shutdown one of the nodes and check the Node to get ready.
kubectl get nodes
NAME          STATUS     ROLES    AGE   VERSION
k8s-test-m1   Ready      <none>   79d   v1.30.1
k8s-test-m2   Ready      <none>   79d   v1.30.1
k8s-test-m3   Ready      <none>   79d   v1.30.1
k8s-test-m4   Ready      <none>   79d   v1.30.1
k8s-test-n1   NotReady   <none>   39h   v1.30.1
k8s-test-n2   NotReady   <none>   39h   v1.30.1
k8s-test-n3   Ready      <none>   69d   v1.30.1
k8s-test-n4   Ready      <none>   39h   v1.30.1

On both nodes, kubernetes related pods just stay on running as status. And all application pods are Terminating.

ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-npz5d                 3/3     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-wvjdn                    3/3     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
kube-system                   calico-node-7qbzl                                1/1     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
metallb-system                speaker-82qn4                                    1/1     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-hjlwq                 3/3     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-zbnmp                    3/3     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
kube-system                   calico-node-qrdt2                                1/1     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
metallb-system                speaker-p74qt                                    1/1     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
× snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite
     Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelite.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/snap.microk8s.daemon-kubelite.service.d
             └─delegate.conf
     Active: failed (Result: exit-code) since Fri 2024-07-19 13:10:13 UTC; 13min ago
   Duration: 426ms
    Process: 1860 ExecStart=/usr/bin/snap run microk8s.daemon-kubelite (code=exited, status=255/EXCEPTION)
   Main PID: 1860 (code=exited, status=255/EXCEPTION)
        CPU: 467ms

Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 5.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: Stopped snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Start request repeated too quickly.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: Failed to start snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.

snap.microk8s.daemon-kubelite.service.txt

after restarting the microk8s worker node manually with sudo snap stop microk8s and sudo snap start microk8s it recovered and reconnected:

ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-npz5d                 3/3     Running   3 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-wvjdn                    3/3     Running   3 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
kube-system                   calico-node-7qbzl                                1/1     Running   1 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
metallb-system                speaker-82qn4                                    1/1     Running   1 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-hjlwq                 3/3     Running   3 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-zbnmp                    3/3     Running   3 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>
kube-system                   calico-node-qrdt2                                1/1     Running   1 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>
metallb-system                speaker-p74qt                                    1/1     Running   1 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>

If restarting the affected node does not work, removing and adding the node again is the only thing which helps.

Introspection Report

todo

Can you suggest a fix?

not at this moment

Are you interested in contributing with a fix?

yes, very happy to test and collaborate further on problem finding

Aaron-Ritter avatar Jul 19 '24 14:07 Aaron-Ritter