harvester icon indicating copy to clipboard operation
harvester copied to clipboard

[Question] Kubernetes wont start after removing first node

Open Iliasb opened this issue 2 years ago • 3 comments

Hello everyone,

I recently had to remove the fist node I added to our Harvester cluster. (Hardware failure) I was able to put the Node in maintenance mode before removing it from the dashboard.

After removing the node in the dashboard the cluster went down. VIP address is unavailable.

When I log on to the running nodes it seems that Kubernetes is also down.

 systemctl status rke2-server.service
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; disabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rke2-server.service.d
             └─override.conf
     Active: activating (auto-restart) (Result: exit-code) since Wed 2022-07-27 15:15:57 UTC; 742ms ago
       Docs: https://github.com/rancher/rke2#readme
    Process: 24598 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited>    Process: 24607 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 24608 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 24609 ExecStartPre=/usr/sbin/harv-update-rke2-server-url server (code=exited, status=0/SUCCESS)
    Process: 24611 ExecStart=/usr/local/bin/rke2 server (code=exited, status=1/FAILURE)
    Process: 24632 ExecStopPost=/bin/sh -c systemd-cgls /system.slice/rke2-server.service | grep -Eo '[0-9]+ (container>   Main PID: 24611 (code=exited, status=1/FAILURE)
kubectl get vm -n harvester-system
W0727 15:15:34.367736   24316 loader.go:221] Config not found: /etc/rancher/rke2/rke2.yaml

What is the best way to debug this? Little bit stuck here

Thanks

Iliasb avatar Jul 27 '22 15:07 Iliasb

@Iliasb how many nodes did you have in your cluster before you removed the first node?

ibrokethecloud avatar Jul 28 '22 00:07 ibrokethecloud

@Iliasb how many nodes did you have in your cluster before you removed the first node?

4 Nodes

Iliasb avatar Jul 29 '22 08:07 Iliasb

Found the issue.
etcdserver/api/etcdhttp: /health error; no leader (status code 503)

How can I select another node as master?

Iliasb avatar Jul 29 '22 10:07 Iliasb

Hi @Iliasb, thanks for filing an issue here. Do you remember whether your cluster had 3 control plane nodes? If yes, you may encounter a know issue #2191. You can try the workaround in the thread https://github.com/harvester/harvester/issues/2191#issuecomment-1115794201. Thank you.

FrankYang0529 avatar Aug 17 '22 08:08 FrankYang0529

The issue is not updated/reported recently, and the farily possible root cause was identified and fixed. close now.

Feel free to reopen, thanks.

w13915984028 avatar Oct 25 '22 10:10 w13915984028