autoscaler Unable to complete scaling down for more than 12 hours

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.26.2

Component version:

What k8s version are you using (kubectl version)?: 1.27

kubectl version Output

$ kubectl version

What environment is this in?: aws

What did you expect to happen?: It should scale down in reasonable time

What happened instead?: I deleted a deployment which triggered a scale down. From cluster-autoscaler log, it correctly identified the event and initiated scaling down. However as soon as it marked the node SchedulingDisabled, it created a new node. Sometimes it marks several nodes SchedulingDisabled, then several new nodes get created. From the log I saw those new nodes were considered unneeded for example: ip-xxx is unneeded since xxx duration xxxs Then 10 mins later those new nodes were marked SchedulingDisabled, then another several new nodes got created. This has lasted for more than 12 hours and is still on going. Sometimes I see pods were shifted around. But most of time those new nodes were empty (no real pod except deamonset). I can't find why it kept creating new nodes and looping in the log

How to reproduce it (as minimally and precisely as possible): It seems rather random. It was working fine, just this time I deleted a deployment that triggered this loop. To replicate, maybe just delete a deployment and there is a chance that it will run into a loop

Anything else we need to know?: Any insight why it keeps creating new node then mark as unneeded when shutting down a node?

this is the yaml I deployed https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

To break out of the loop, I found a way which is manually deleting the new nodes that it has created (those only have deamonset) before they are marked SchedulingDisabled.

Sep 22 '23 18:09 lz000

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 30 '24 03:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 29 '24 03:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 20 '24 13:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 20 '24 13:04 k8s-ci-robot

autoscaler autoscaler copied to clipboard

Unable to complete scaling down for more than 12 hours

autoscaler
autoscaler copied to clipboard