k3s
k3s copied to clipboard
Cannot add agent node with a previously used hostname (that was deleted)
Environmental Info: K3s Version: v1.24.3+k3s1
Node(s) CPU architecture, OS, and Version: Linux fleetcom-node4 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 3 servers, 1 agent (trying to add back one more agent)
Describe the bug:
Trying to add back a node that I removed to upgrade hardware in with the same hostname as before. It got a brand new hard drive and more ram (if that matters). I reinstalled Ubuntu 22.04, which is the version it was on before, and then proceeded to try to add it back to the cluster. While running the install script, it hangs at [INFO] systemd: Starting k3s-agent. Upon running systemctl status k3s-agent, I can see a log being repeated that reads as: Aug 30 14:00:20 fleetcom-node4 k3s[1697]: time="2022-08-30T14:00:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag
Steps To Reproduce:
- Get server token (using
sudo cat /var/lib/rancher/k3s/server/node-tokenon master node) - Run command:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.24.3+k3s1" K3S_URL=https://10.5.0.3:6443 K3S_TOKEN="<NODE-TOKEN>" sh - - Runs and then hangs at
systemd: Starting k3s-agentas stated in the 'Describe the Bug' section above.
Expected behavior:
I am trying to add this node back with the same hostname as before I removed it. I would expect it to add back without issue since I drained the node and deleted it from the cluster before changing out the SSD.
Actual behavior:
Getting this log message: Aug 30 14:00:20 fleetcom-node4 k3s[1697]: time="2022-08-30T14:00:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag
Additional context / logs:
Before changing out the hardware, I ran a drain command and then kubectl delete fleetcom-node4.
Can you confirm that the fleetcom-node4 node was successfully deleted from the cluster? The error message you're receiving indicates that it is still there.
Can you confirm that the
fleetcom-node4node was successfully deleted from the cluster? The error message you're receiving indicates that it is still there.
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
fleetcom-node1 Ready control-plane,etcd,master 21d v1.24.3+k3s1
fleetcom-node2 Ready control-plane,etcd,master 21d v1.24.3+k3s1
fleetcom-node3 Ready control-plane,etcd,master 21d v1.24.3+k3s1
fleetcom-node5 NotReady <none> 2d1h v1.24.3+k3s1
(Node5 is a VM that I have powered off at the moment)
Can you try kubectl get secret -n kube-system fleetcom-node4.node-password.k3s ? This secret should have been deleted when you removed the node from the cluster; if it still remains then delete it and try registering the node again.
> kubectl get secret -n kube-system fleetcom-node4.node-password.k3s
Error from server (NotFound): secrets "fleetcom-node4.node-password.k3s" not found
> kubectl get secret -n kube-system
NAME TYPE DATA AGE
fleetcom-node1.node-password.k3s Opaque 1 21d
fleetcom-node2.node-password.k3s Opaque 1 21d
fleetcom-node3.node-password.k3s Opaque 1 21d
fleetcom-node5.node-password.k3s Opaque 1 2d1h
k3s-serving kubernetes.io/tls 2 21d
And the node still will not successfully join the cluster? Do you see anything in the logs on the servers?
Correct. I have looked for anything that might be helpful but maybe I am looking in the wrong places. Suggestions? So far I have found no clues to why I cant add it.
Can you attach the full logs from all 4 nodes? The 3 active servers and the node you're trying to join?
Files were large, you can get them from my Google Drive - k3s-logs
Had to get the service logs since there were none for the one I was trying to add.
I wasn't able to find anything useful in the logs. In particular, I noticed that the logs on the servers stop at August 29th, while the only logs from the node that's failing to join are from the 31st. Makes it kind of hard to correlate anything.
If you're still fighting this, have you tried removing /etc/rancher/node/password from the node before rejoining it?
@brandond I did try that. But honestly, this helped give me the kick in the pants I needed to setup something like Flux. And I decided that I should just go the route of adding the node id at the end of the name as well probably to help minimize similar issues to this.
So this issue can be closed unless it is something the project wants to continue to investigate.