kured Node stays on Ready,SchedulingDisabled

We are using Kured on AKS and I regularly I see that nodes stay on status Ready,SchedulingDisabled and I have to uncordon them manually. When I look into the log file of the kured pod it shows: time="2019-03-06T06:30:27Z" level=info msg="Kubernetes Reboot Daemon: 1.1.0" time="2019-03-06T06:30:27Z" level=info msg="Node ID: aks-default-13951270-0" time="2019-03-06T06:30:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2019-03-06T06:30:27Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 1h0m0s" time="2019-03-06T06:30:28Z" level=info msg="Holding lock" time="2019-03-06T06:30:28Z" level=info msg="Uncordoning node aks-default-13951270-0" time="2019-03-06T06:30:29Z" level=info msg="node/aks-default-13951270-0 uncordoned" cmd=/usr/bin/kubectl std=out time="2019-03-06T06:30:29Z" level=info msg="Releasing lock"

So it says it uncordoned it, but still I regularly see that nodes are in fact not uncordoned. Is this something you guys see more often?

Mar 11 '19 13:03 bramvdklinkenberg

I've encountered this twice this morning.

May 16 '19 08:05 sylr

@bramvdklinkenberg interesting - thanks for the report. @sylr are you on AKS too?

Kured simply runs kubectl uncordon $node- you can see the output from that command above indicated success:

Uncordoning node aks-default-13951270-0
node/aks-default-13951270-0 uncordoned

The only thing that comes to mind is that there's some incompatibility between the version of kubectl in the kured images you're using and AKS - next time this happens, it would be interesting if you could report

The AKS kubernetes server version
The kured image version
The version of kubectl that you subsequently used to uncordon successfully

May 16 '19 13:05 awh

Hi @awh , I am no longer working at the customer where I encountered the issue. Maybe @sylr can provide the requested info

May 17 '19 06:05 bramvdklinkenberg

I've got the same issue using AKS and the versions are as follows.

Kubernetes version 1.13.5
Kured image: weaveworks/kured:1.1.0
kubectl client: Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0" ...}

Jun 05 '19 15:06 ryanfernandes09

Had an issue just like this, versions following.

AKS Kubernetes server version is 1.12.7
The kured image version is weaveworks/kured:1.2.0
The working local kubectl version is version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"windows/amd64"}

Aug 17 '19 03:08 adelisle

It will has this problem when there is only 1 node in AKS because the pod cannot be re-created after node rebooted.

Aug 24 '19 09:08 hermanho

@hermanho so how to re-create the pods after node reboot?

Jan 10 '20 23:01 prabhakarreddy1234

Encountered this problem on AKS as well. Had to adjust my podDisruptionBudget before draining OK.

Details:

Kubernetes version: Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:34:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
kured version: docker.io/weaveworks/kured:1.2.0
kubectl version: Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-15T15:50:38Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"darwin/amd64"}

Jan 17 '20 09:01 evenh

Each time I get this is when the reboot can not occur because a Pod Disruption Budget does not allow pods to be killed on the node kured is trying to drain.

Jan 17 '20 09:01 sylr

@hermanho so how to re-create the pods after node reboot?

kubectl uncordon <your node name>

Jan 17 '20 11:01 hermanho

Hi guys, we just had this issue on one of our AKS cluster but we don't have any PDB configured...

AKS Version: v1.15.7
kured image: docker.io/weaveworks/kured:master-f6e4062

Mar 18 '20 16:03 remiserriere

We also encountered this issue in one of our aks 3 nodes cluster.

AKS Version: v1.14.7 kured image: docker.io/weaveworks/kured:1.3.0

We also observed another behaviour, although rebooted machine was in Ready,SchedulingDisabled state, kured continued to patch the next machine.

Apr 09 '20 18:04 ashwajce

Hello, Same for us. AKS Version: v1.15.10 kured image: docker.io/weaveworks/kured:1.3.0

Apr 10 '20 13:04 Subreptivus

Same here. No PDB, kured 1.3.0, 2 nodes AKS cluster, kubernetes version 1.16.8 One cluster isn't uncordoned after reboot

found a timeout error in the log: error: unable to uncordon node ... Timeout: request did not complete within requested timeout 30s" cmd=/usr/bin/kubectl std=err

a lack of resources might has something to do with that

Jun 08 '20 13:06 pvdulmen

We also had this problem: AKS kubernetes v1.15.10 kured image: docker.io/weaveworks/kured:1.3.0

Jun 17 '20 13:06 jvassbo

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

Nov 27 '20 01:11 github-actions[bot]

Problem persists on AKS with K8S v1.16.15.

Dec 03 '20 10:12 s-spindler

https://github.com/weaveworks/kured/pull/283

Dec 29 '20 14:12 flbla

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

Feb 28 '21 02:02 github-actions[bot]

Still having this/similar issue. If a reboot/drain hangs because of a pdb and you manually delete the pod, a reboot is done but the node is still unschedulable (SchedulingDisabled)

details of node: Events: Type Reason Age From Message

Normal Starting 42s kubelet Starting kubelet. Normal NodeAllocatableEnforced 42s kubelet Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 42s kubelet Node xxx-xx00004c status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 42s kubelet Node xxx-xx00004c status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 42s kubelet Node xxx-xx00004c status is now: NodeHasSufficientPID Warning Rebooted 42s kubelet Node xxx-xx00004c has been rebooted, boot id: xxx-xxx-xxx-xxx-xxx Normal NodeNotReady 42s kubelet Node xxx-xx00004c status is now: NodeNotReady Normal NodeNotSchedulable 42s kubelet Node xxx-xx00004c status is now: NodeNotSchedulable Normal Starting 36s kube-proxy Starting kube-proxy. Normal NodeReady 32s kubelet Node xxx-xx00004c status is now: NodeReady

Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true

so in the end the node is still NotSchedulable

Mar 01 '21 08:03 pvdulmen

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

May 01 '21 01:05 github-actions[bot]

We are still having this issue, any progress here? Thanks!

May 17 '21 10:05 MKruger777

Issue still exists, would be great that this is fixed.

Jun 30 '21 07:06 avwsolutions

Running into the issue again at another customer @awh Kubectl:"v1.21.2" AKS: "v1.19.11" Kured: "1.6.1"

Jun 30 '21 07:06 bramvandenklinkenberg

I think, on top of the versions used, it would be good to know a bit about the environment. What are the CLI arguments you give? Did you wait for the period? What is the sentinel reporting?

Jul 28 '21 07:07 evrardjp

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

Sep 28 '21 01:09 github-actions[bot]

Having the same issue on aks: AKS: v1.20.5 Kured: 1.7

Oct 06 '21 06:10 GeorgHoffmeyer

Encountered the same issue. AKS: 1.21.2 Kured: 1.8.1

Dec 07 '21 09:12 KristapsT

Is anyone experiencing this who isn't using PodDisruptionBudget configurations?

Something like this would be in the kured logs:

"error when evicting pods/"coredns-58567c6d46-sdctb" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Dec 15 '21 21:12 jackfrancis

Since @hermanho mentioned single node clusters kured won't work on a single node aks cluster because aks sets a pdb on coredns.

There are other cases where pdbs will block too like if its configrued to allow no disruptions (maxUnavailable=0 || minAvailable==replicas) or if pods are actually not coming up to ready after eviction. But you would see the "error when evicting pods" message in kured logs menteions above.

We're less clear why this might happens when you its not blocking on evcition/pdb..

Dec 15 '21 21:12 paulgmiller

kured kured copied to clipboard

Node stays on Ready,SchedulingDisabled

kured
kured copied to clipboard