autoscaler
autoscaler copied to clipboard
Unable to complete scaling down for more than 12 hours
Which component are you using?: cluster-autoscaler
What version of the component are you using?: v1.26.2
Component version:
What k8s version are you using (kubectl version
)?:
1.27
kubectl version
Output
$ kubectl version
What environment is this in?: aws
What did you expect to happen?: It should scale down in reasonable time
What happened instead?:
I deleted a deployment which triggered a scale down. From cluster-autoscaler log, it correctly identified the event and initiated scaling down. However as soon as it marked the node SchedulingDisabled
, it created a new node. Sometimes it marks several nodes SchedulingDisabled
, then several new nodes get created. From the log I saw those new nodes were considered unneeded
for example: ip-xxx is unneeded since xxx duration xxxs
Then 10 mins later those new nodes were marked SchedulingDisabled
, then another several new nodes got created. This has lasted for more than 12 hours and is still on going. Sometimes I see pods were shifted around. But most of time those new nodes were empty (no real pod except deamonset). I can't find why it kept creating new nodes and looping in the log
How to reproduce it (as minimally and precisely as possible): It seems rather random. It was working fine, just this time I deleted a deployment that triggered this loop. To replicate, maybe just delete a deployment and there is a chance that it will run into a loop
Anything else we need to know?: Any insight why it keeps creating new node then mark as unneeded when shutting down a node?
this is the yaml I deployed https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
To break out of the loop, I found a way which is manually deleting the new nodes that it has created (those only have deamonset) before they are marked SchedulingDisabled
.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.