kured icon indicating copy to clipboard operation
kured copied to clipboard

Node stays on Ready,SchedulingDisabled

Open bramvdklinkenberg opened this issue 5 years ago • 45 comments

We are using Kured on AKS and I regularly I see that nodes stay on status Ready,SchedulingDisabled and I have to uncordon them manually. When I look into the log file of the kured pod it shows: time="2019-03-06T06:30:27Z" level=info msg="Kubernetes Reboot Daemon: 1.1.0" time="2019-03-06T06:30:27Z" level=info msg="Node ID: aks-default-13951270-0" time="2019-03-06T06:30:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2019-03-06T06:30:27Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 1h0m0s" time="2019-03-06T06:30:28Z" level=info msg="Holding lock" time="2019-03-06T06:30:28Z" level=info msg="Uncordoning node aks-default-13951270-0" time="2019-03-06T06:30:29Z" level=info msg="node/aks-default-13951270-0 uncordoned" cmd=/usr/bin/kubectl std=out time="2019-03-06T06:30:29Z" level=info msg="Releasing lock"

So it says it uncordoned it, but still I regularly see that nodes are in fact not uncordoned. Is this something you guys see more often?

bramvdklinkenberg avatar Mar 11 '19 13:03 bramvdklinkenberg

I've encountered this twice this morning.

sylr avatar May 16 '19 08:05 sylr

@bramvdklinkenberg interesting - thanks for the report. @sylr are you on AKS too?

Kured simply runs kubectl uncordon $node- you can see the output from that command above indicated success:

Uncordoning node aks-default-13951270-0
node/aks-default-13951270-0 uncordoned

The only thing that comes to mind is that there's some incompatibility between the version of kubectl in the kured images you're using and AKS - next time this happens, it would be interesting if you could report

  1. The AKS kubernetes server version
  2. The kured image version
  3. The version of kubectl that you subsequently used to uncordon successfully

awh avatar May 16 '19 13:05 awh

Hi @awh , I am no longer working at the customer where I encountered the issue. Maybe @sylr can provide the requested info

bramvdklinkenberg avatar May 17 '19 06:05 bramvdklinkenberg

I've got the same issue using AKS and the versions are as follows.

  1. Kubernetes version 1.13.5
  2. Kured image: weaveworks/kured:1.1.0
  3. kubectl client: Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0" ...}

ryanfernandes09 avatar Jun 05 '19 15:06 ryanfernandes09

Had an issue just like this, versions following.

  1. AKS Kubernetes server version is 1.12.7
  2. The kured image version is weaveworks/kured:1.2.0
  3. The working local kubectl version is version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"windows/amd64"}

adelisle avatar Aug 17 '19 03:08 adelisle

It will has this problem when there is only 1 node in AKS because the pod cannot be re-created after node rebooted.

hermanho avatar Aug 24 '19 09:08 hermanho

@hermanho so how to re-create the pods after node reboot?

prabhakarreddy1234 avatar Jan 10 '20 23:01 prabhakarreddy1234

Encountered this problem on AKS as well. Had to adjust my podDisruptionBudget before draining OK.

Details:

Kubernetes version: Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:34:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
kured version: docker.io/weaveworks/kured:1.2.0
kubectl version: Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-15T15:50:38Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"darwin/amd64"}

evenh avatar Jan 17 '20 09:01 evenh

Each time I get this is when the reboot can not occur because a Pod Disruption Budget does not allow pods to be killed on the node kured is trying to drain.

sylr avatar Jan 17 '20 09:01 sylr

@hermanho so how to re-create the pods after node reboot?

kubectl uncordon <your node name>

hermanho avatar Jan 17 '20 11:01 hermanho

Hi guys, we just had this issue on one of our AKS cluster but we don't have any PDB configured...

  • AKS Version: v1.15.7
  • kured image: docker.io/weaveworks/kured:master-f6e4062

remiserriere avatar Mar 18 '20 16:03 remiserriere

We also encountered this issue in one of our aks 3 nodes cluster.

AKS Version: v1.14.7 kured image: docker.io/weaveworks/kured:1.3.0

We also observed another behaviour, although rebooted machine was in Ready,SchedulingDisabled state, kured continued to patch the next machine.

ashwajce avatar Apr 09 '20 18:04 ashwajce

Hello, Same for us. AKS Version: v1.15.10 kured image: docker.io/weaveworks/kured:1.3.0

Subreptivus avatar Apr 10 '20 13:04 Subreptivus

Same here. No PDB, kured 1.3.0, 2 nodes AKS cluster, kubernetes version 1.16.8 One cluster isn't uncordoned after reboot

found a timeout error in the log: error: unable to uncordon node ... Timeout: request did not complete within requested timeout 30s" cmd=/usr/bin/kubectl std=err

a lack of resources might has something to do with that

pvdulmen avatar Jun 08 '20 13:06 pvdulmen

We also had this problem: AKS kubernetes v1.15.10 kured image: docker.io/weaveworks/kured:1.3.0

jvassbo avatar Jun 17 '20 13:06 jvassbo

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

github-actions[bot] avatar Nov 27 '20 01:11 github-actions[bot]

Problem persists on AKS with K8S v1.16.15.

s-spindler avatar Dec 03 '20 10:12 s-spindler

https://github.com/weaveworks/kured/pull/283

flbla avatar Dec 29 '20 14:12 flbla

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

github-actions[bot] avatar Feb 28 '21 02:02 github-actions[bot]

Still having this/similar issue. If a reboot/drain hangs because of a pdb and you manually delete the pod, a reboot is done but the node is still unschedulable (SchedulingDisabled)

details of node: Events: Type Reason Age From Message


Normal Starting 42s kubelet Starting kubelet. Normal NodeAllocatableEnforced 42s kubelet Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 42s kubelet Node xxx-xx00004c status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 42s kubelet Node xxx-xx00004c status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 42s kubelet Node xxx-xx00004c status is now: NodeHasSufficientPID Warning Rebooted 42s kubelet Node xxx-xx00004c has been rebooted, boot id: xxx-xxx-xxx-xxx-xxx Normal NodeNotReady 42s kubelet Node xxx-xx00004c status is now: NodeNotReady Normal NodeNotSchedulable 42s kubelet Node xxx-xx00004c status is now: NodeNotSchedulable Normal Starting 36s kube-proxy Starting kube-proxy. Normal NodeReady 32s kubelet Node xxx-xx00004c status is now: NodeReady

Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true

so in the end the node is still NotSchedulable

pvdulmen avatar Mar 01 '21 08:03 pvdulmen

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

github-actions[bot] avatar May 01 '21 01:05 github-actions[bot]

We are still having this issue, any progress here? Thanks!

MKruger777 avatar May 17 '21 10:05 MKruger777

Issue still exists, would be great that this is fixed.

avwsolutions avatar Jun 30 '21 07:06 avwsolutions

Running into the issue again at another customer @awh Kubectl:"v1.21.2" AKS: "v1.19.11" Kured: "1.6.1"

bramvandenklinkenberg avatar Jun 30 '21 07:06 bramvandenklinkenberg

I think, on top of the versions used, it would be good to know a bit about the environment. What are the CLI arguments you give? Did you wait for the period? What is the sentinel reporting?

evrardjp avatar Jul 28 '21 07:07 evrardjp

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

github-actions[bot] avatar Sep 28 '21 01:09 github-actions[bot]

Having the same issue on aks: AKS: v1.20.5 Kured: 1.7

GeorgHoffmeyer avatar Oct 06 '21 06:10 GeorgHoffmeyer

Encountered the same issue. AKS: 1.21.2 Kured: 1.8.1

KristapsT avatar Dec 07 '21 09:12 KristapsT

Is anyone experiencing this who isn't using PodDisruptionBudget configurations?

Something like this would be in the kured logs:

"error when evicting pods/"coredns-58567c6d46-sdctb" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

jackfrancis avatar Dec 15 '21 21:12 jackfrancis

Since @hermanho mentioned single node clusters kured won't work on a single node aks cluster because aks sets a pdb on coredns.

There are other cases where pdbs will block too like if its configrued to allow no disruptions (maxUnavailable=0 || minAvailable==replicas) or if pods are actually not coming up to ready after eviction. But you would see the "error when evicting pods" message in kured logs menteions above.

We're less clear why this might happens when you its not blocking on evcition/pdb..

paulgmiller avatar Dec 15 '21 21:12 paulgmiller