k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Cannot add agent node with a previously used hostname (that was deleted)

Open binaryn3xus opened this issue 3 years ago • 10 comments

Environmental Info: K3s Version: v1.24.3+k3s1

Node(s) CPU architecture, OS, and Version: Linux fleetcom-node4 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 servers, 1 agent (trying to add back one more agent)

Describe the bug: Trying to add back a node that I removed to upgrade hardware in with the same hostname as before. It got a brand new hard drive and more ram (if that matters). I reinstalled Ubuntu 22.04, which is the version it was on before, and then proceeded to try to add it back to the cluster. While running the install script, it hangs at [INFO] systemd: Starting k3s-agent. Upon running systemctl status k3s-agent, I can see a log being repeated that reads as: Aug 30 14:00:20 fleetcom-node4 k3s[1697]: time="2022-08-30T14:00:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag

Steps To Reproduce:

  1. Get server token (using sudo cat /var/lib/rancher/k3s/server/node-token on master node)
  2. Run command: curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.24.3+k3s1" K3S_URL=https://10.5.0.3:6443 K3S_TOKEN="<NODE-TOKEN>" sh -
  3. Runs and then hangs at systemd: Starting k3s-agent as stated in the 'Describe the Bug' section above.

Expected behavior:

I am trying to add this node back with the same hostname as before I removed it. I would expect it to add back without issue since I drained the node and deleted it from the cluster before changing out the SSD.

Actual behavior:

Getting this log message: Aug 30 14:00:20 fleetcom-node4 k3s[1697]: time="2022-08-30T14:00:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag

Additional context / logs:

Before changing out the hardware, I ran a drain command and then kubectl delete fleetcom-node4.

binaryn3xus avatar Aug 30 '22 14:08 binaryn3xus

Can you confirm that the fleetcom-node4 node was successfully deleted from the cluster? The error message you're receiving indicates that it is still there.

brandond avatar Aug 30 '22 19:08 brandond

Can you confirm that the fleetcom-node4 node was successfully deleted from the cluster? The error message you're receiving indicates that it is still there.

> kubectl get nodes
NAME             STATUS     ROLES                       AGE    VERSION
fleetcom-node1   Ready      control-plane,etcd,master   21d    v1.24.3+k3s1
fleetcom-node2   Ready      control-plane,etcd,master   21d    v1.24.3+k3s1
fleetcom-node3   Ready      control-plane,etcd,master   21d    v1.24.3+k3s1
fleetcom-node5   NotReady   <none>                      2d1h   v1.24.3+k3s1

(Node5 is a VM that I have powered off at the moment)

binaryn3xus avatar Aug 30 '22 19:08 binaryn3xus

Can you try kubectl get secret -n kube-system fleetcom-node4.node-password.k3s ? This secret should have been deleted when you removed the node from the cluster; if it still remains then delete it and try registering the node again.

brandond avatar Aug 30 '22 20:08 brandond

> kubectl get secret -n kube-system fleetcom-node4.node-password.k3s
Error from server (NotFound): secrets "fleetcom-node4.node-password.k3s" not found

> kubectl get secret -n kube-system
NAME                               TYPE                DATA   AGE
fleetcom-node1.node-password.k3s   Opaque              1      21d
fleetcom-node2.node-password.k3s   Opaque              1      21d
fleetcom-node3.node-password.k3s   Opaque              1      21d
fleetcom-node5.node-password.k3s   Opaque              1      2d1h
k3s-serving                        kubernetes.io/tls   2      21d

binaryn3xus avatar Aug 30 '22 20:08 binaryn3xus

And the node still will not successfully join the cluster? Do you see anything in the logs on the servers?

brandond avatar Aug 30 '22 21:08 brandond

Correct. I have looked for anything that might be helpful but maybe I am looking in the wrong places. Suggestions? So far I have found no clues to why I cant add it.

binaryn3xus avatar Aug 30 '22 22:08 binaryn3xus

Can you attach the full logs from all 4 nodes? The 3 active servers and the node you're trying to join?

brandond avatar Aug 31 '22 01:08 brandond

Files were large, you can get them from my Google Drive - k3s-logs

Had to get the service logs since there were none for the one I was trying to add.

binaryn3xus avatar Aug 31 '22 02:08 binaryn3xus

I wasn't able to find anything useful in the logs. In particular, I noticed that the logs on the servers stop at August 29th, while the only logs from the node that's failing to join are from the 31st. Makes it kind of hard to correlate anything.

If you're still fighting this, have you tried removing /etc/rancher/node/password from the node before rejoining it?

brandond avatar Sep 10 '22 00:09 brandond

@brandond I did try that. But honestly, this helped give me the kick in the pants I needed to setup something like Flux. And I decided that I should just go the route of adding the node id at the end of the name as well probably to help minimize similar issues to this.

So this issue can be closed unless it is something the project wants to continue to investigate.

binaryn3xus avatar Sep 10 '22 00:09 binaryn3xus