k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Nodes not in ready state after certificate renewal on master node

Open nikhilbalekundargi opened this issue 1 year ago • 4 comments

I was not able to connect to the k3s cluster.

root:/etc/rancher/k3s# kubectl get pods --kubeconfig k3s.yaml error: You must be logged in to the server (Unauthorized)

Then checked the certificates and renewed following https://www.ibm.com/support/pages/node/6444205. Cluster access issue resolved.

Post certificate renewal nodes are in notready state.

root@jump:~# kubectl get nodes --kubeconfig .kube/k3s-stg-config 
NAME        STATUS     ROLES                  AGE     VERSION
stg-vgw-2   NotReady   <none>                 2y48d   v1.21.1+k3s1
stg-vgw-3   NotReady   <none>                 2y48d   v1.21.1+k3s1
stg-vgw-1   Ready      control-plane,master   2y48d   v1.21.1+k3s1

Describe node output

Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Mon, 07 Jun 2021 06:46:40 +0000   Mon, 07 Jun 2021 06:46:40 +0000   FlannelIsUp         Flannel is running on this node
  MemoryPressure       Unknown   Tue, 07 Jun 2022 06:42:06 +0000   Thu, 14 Jul 2022 06:46:14 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Tue, 07 Jun 2022 06:42:06 +0000   Thu, 14 Jul 2022 06:46:14 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Tue, 07 Jun 2022 06:42:06 +0000   Thu, 14 Jul 2022 06:46:14 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Tue, 07 Jun 2022 06:42:06 +0000   Thu, 14 Jul 2022 06:46:14 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:

Need help is bringing the nodes to ready state

nikhilbalekundargi avatar Jul 14 '22 08:07 nikhilbalekundargi

Just showing that there's a problem doesn't give us much to work with; some diagnostic information would be helpful. For example, what do the k3s-agent service logs on the NotReady nodes show? If a quick examination doesn't reveal anything useful, can you attach them to this issue?

Also, those 3rd party instructions that you followed to rotate the cert appear to be for k3s v1.18; those steps should no longer be necessary and are not related to the problem you were experiencing.

brandond avatar Jul 14 '22 17:07 brandond

I see similar behavior (worker "NotReady") with the following version:

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3+k3s1", GitCommit:"990ba0e88c90f8ed8b50e0ccd375937b841b176e", GitTreeState:"clean", BuildDate:"2022-07-19T01:10:03Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}

Each day, the worker loses connectivity with the master node. Could be certificate renewal, how can I check?

mihaigalos avatar Aug 04 '22 18:08 mihaigalos

Each day, the worker loses connectivity with the master node. Could be certificate renewal

@mihaigalos certificate renewal only happens when k3s is starting, not daily, so I would doubt that's related, unless you're restarting the k3s process every day.

Just showing what version you're running doesn't give us much to work with. Can you open a new issue, and attach k3s/k3s-agent journald logs from the nodes in question?

brandond avatar Aug 04 '22 18:08 brandond

Just showing what version you're running doesn't give us much to work with. Can you open a new issue, and attach k3s/k3s-agent journald logs from the nodes in question?

I realize that now, sorry. As soon as I can reproduce, I'm creating an issue with logs. Cluster looks stable atm.

mihaigalos avatar Aug 05 '22 05:08 mihaigalos

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

stale[bot] avatar Feb 01 '23 07:02 stale[bot]