kured
kured copied to clipboard
Node stays on Ready,SchedulingDisabled
We are using Kured on AKS and I regularly I see that nodes stay on status Ready,SchedulingDisabled and I have to uncordon them manually. When I look into the log file of the kured pod it shows: time="2019-03-06T06:30:27Z" level=info msg="Kubernetes Reboot Daemon: 1.1.0" time="2019-03-06T06:30:27Z" level=info msg="Node ID: aks-default-13951270-0" time="2019-03-06T06:30:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2019-03-06T06:30:27Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 1h0m0s" time="2019-03-06T06:30:28Z" level=info msg="Holding lock" time="2019-03-06T06:30:28Z" level=info msg="Uncordoning node aks-default-13951270-0" time="2019-03-06T06:30:29Z" level=info msg="node/aks-default-13951270-0 uncordoned" cmd=/usr/bin/kubectl std=out time="2019-03-06T06:30:29Z" level=info msg="Releasing lock"
So it says it uncordoned it, but still I regularly see that nodes are in fact not uncordoned. Is this something you guys see more often?
I've encountered this twice this morning.
@bramvdklinkenberg interesting - thanks for the report. @sylr are you on AKS too?
Kured simply runs kubectl uncordon $node
- you can see the output from that command above indicated success:
Uncordoning node aks-default-13951270-0
node/aks-default-13951270-0 uncordoned
The only thing that comes to mind is that there's some incompatibility between the version of kubectl in the kured images you're using and AKS - next time this happens, it would be interesting if you could report
- The AKS kubernetes server version
- The kured image version
- The version of
kubectl
that you subsequently used to uncordon successfully
Hi @awh , I am no longer working at the customer where I encountered the issue. Maybe @sylr can provide the requested info
I've got the same issue using AKS and the versions are as follows.
- Kubernetes version 1.13.5
- Kured image: weaveworks/kured:1.1.0
- kubectl client: Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0" ...}
Had an issue just like this, versions following.
- AKS Kubernetes server version is
1.12.7
- The kured image version is
weaveworks/kured:1.2.0
- The working local kubectl version is
version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"windows/amd64"}
It will has this problem when there is only 1 node in AKS because the pod cannot be re-created after node rebooted.
@hermanho so how to re-create the pods after node reboot?
Encountered this problem on AKS as well. Had to adjust my podDisruptionBudget before draining OK.
Details:
Kubernetes version: Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:34:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
kured version: docker.io/weaveworks/kured:1.2.0
kubectl version: Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-15T15:50:38Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"darwin/amd64"}
Each time I get this is when the reboot can not occur because a Pod Disruption Budget does not allow pods to be killed on the node kured is trying to drain.
@hermanho so how to re-create the pods after node reboot?
kubectl uncordon <your node name>
Hi guys, we just had this issue on one of our AKS cluster but we don't have any PDB configured...
- AKS Version: v1.15.7
- kured image: docker.io/weaveworks/kured:master-f6e4062
We also encountered this issue in one of our aks 3 nodes cluster.
AKS Version: v1.14.7 kured image: docker.io/weaveworks/kured:1.3.0
We also observed another behaviour, although rebooted machine was in Ready,SchedulingDisabled state, kured continued to patch the next machine.
Hello, Same for us. AKS Version: v1.15.10 kured image: docker.io/weaveworks/kured:1.3.0
Same here. No PDB, kured 1.3.0, 2 nodes AKS cluster, kubernetes version 1.16.8 One cluster isn't uncordoned after reboot
found a timeout error in the log: error: unable to uncordon node ... Timeout: request did not complete within requested timeout 30s" cmd=/usr/bin/kubectl std=err
a lack of resources might has something to do with that
We also had this problem: AKS kubernetes v1.15.10 kured image: docker.io/weaveworks/kured:1.3.0
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
Problem persists on AKS with K8S v1.16.15.
https://github.com/weaveworks/kured/pull/283
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
Still having this/similar issue. If a reboot/drain hangs because of a pdb and you manually delete the pod, a reboot is done but the node is still unschedulable (SchedulingDisabled)
details of node: Events: Type Reason Age From Message
Normal Starting 42s kubelet Starting kubelet. Normal NodeAllocatableEnforced 42s kubelet Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 42s kubelet Node xxx-xx00004c status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 42s kubelet Node xxx-xx00004c status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 42s kubelet Node xxx-xx00004c status is now: NodeHasSufficientPID Warning Rebooted 42s kubelet Node xxx-xx00004c has been rebooted, boot id: xxx-xxx-xxx-xxx-xxx Normal NodeNotReady 42s kubelet Node xxx-xx00004c status is now: NodeNotReady Normal NodeNotSchedulable 42s kubelet Node xxx-xx00004c status is now: NodeNotSchedulable Normal Starting 36s kube-proxy Starting kube-proxy. Normal NodeReady 32s kubelet Node xxx-xx00004c status is now: NodeReady
Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true
so in the end the node is still NotSchedulable
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
We are still having this issue, any progress here? Thanks!
Issue still exists, would be great that this is fixed.
Running into the issue again at another customer @awh Kubectl:"v1.21.2" AKS: "v1.19.11" Kured: "1.6.1"
I think, on top of the versions used, it would be good to know a bit about the environment. What are the CLI arguments you give? Did you wait for the period? What is the sentinel reporting?
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
Having the same issue on aks: AKS: v1.20.5 Kured: 1.7
Encountered the same issue. AKS: 1.21.2 Kured: 1.8.1
Is anyone experiencing this who isn't using PodDisruptionBudget configurations?
Something like this would be in the kured logs:
"error when evicting pods/"coredns-58567c6d46-sdctb" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Since @hermanho mentioned single node clusters kured won't work on a single node aks cluster because aks sets a pdb on coredns.
There are other cases where pdbs will block too like if its configrued to allow no disruptions (maxUnavailable=0 || minAvailable==replicas) or if pods are actually not coming up to ready after eviction. But you would see the "error when evicting pods" message in kured logs menteions above.
We're less clear why this might happens when you its not blocking on evcition/pdb..