autoscaler cluster autoscaler should consider availability zone balancing during scaledown

We are running a cluster in AWS EKS that uses nodes from auto-scaling groups. We have noticed that whenever the autoscaler terminates a node during scaledown, the auto-scaling group triggers an availability zone rebalancing shortly thereafter. This in turn leads to a spike in errors. It would be preferable if the cluster autoscaler properly considered availability zones during scaledown, shuffling pods between nodes as necessary to preemptively avoid a rebalancing.

Nov 14 '20 16:11 rittneje

We are also seeing this even after removing --balance-similar-node-groups suggested in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#common-notes-and-gotchas.

May be we'll give Suspended processes in ASG console a try.

Nov 23 '20 23:11 knkarthik

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Feb 21 '21 23:02 fejta-bot

/remove-lifecycle stale

Feb 21 '21 23:02 rittneje

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

May 22 '21 23:05 fejta-bot

/remove-lifecycle stale

May 23 '21 00:05 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 21 '21 00:08 k8s-triage-robot

/remove-lifecycle stale

Aug 21 '21 00:08 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 14 '21 16:12 k8s-triage-robot

/remove-lifecycle stale

Dec 14 '21 18:12 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 14 '22 19:03 k8s-triage-robot

/remove-lifecycle stale

Mar 14 '22 19:03 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 12 '22 19:06 k8s-triage-robot

/remove-lifecycle stale

Jun 12 '22 20:06 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 10 '22 20:09 k8s-triage-robot

/remove-lifecycle stale

Sep 11 '22 01:09 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 10 '22 02:12 k8s-triage-robot

/remove-lifecycle stale

Dec 10 '22 02:12 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 10 '23 03:03 k8s-triage-robot

/remove-lifecycle stale

Mar 10 '23 03:03 rittneje

Hi all, are there news in the meantime? We're experiencing this issue I think, where the aws autoscaling rebalances and consequently scales down for overprovisioning, independently from the cluster-autoscaler work:

MidTerminatingLifecycleAction
	Terminating EC2 instance: i-XYZ	At 2023-05-08T12:08:31Z instances were launched to balance instances in zones eu-west-1b eu-west-1a with other zones resulting in more than desired number of instances in the group.
	At 2023-05-08T12:08:42Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 12 to 11.
	At 2023-05-08T12:08:42Z instance i-XYZ was selected for termination.

The result is that AWS autoscaling seems to spawn new instances for balancing the avilabilty across the AZs, but exceeding the desired count of instances, it terminates an instance to match the desired count (managed by the cluster-autoscaler), obviously bypassing the cluster-autoscaler.

Furthermore I have workload that can't be evicted (and for which set the cluster-autoscaler.kubernetes.io/safe-to-evict: "false"), and the AWS autoscalig is agnostic to that obviously.

Am I missing something?

May 08 '23 17:05 maxgio92

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 19 '24 21:01 k8s-triage-robot

/remove-lifecycle stale

Jan 19 '24 23:01 rittneje

Pinging repo approvers about this feature request. @mwielgus @MaciekPytel @gjtempleton

This seems to be a valid issue. It is documented in this repo documentation here:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#im-running-cluster-with-nodes-in-multiple-zones-for-ha-purposes-is-that-supported-by-cluster-autoscaler

Currently the balancing is only done at scale-up. Cluster Autoscaler will still scale down underutilized nodes regardless of the relative sizes of underlying node groups. We plan to take balancing into account in scale-down in the future.

Is the sentence "We plan to take balancing into account in scale-down in the future" still valid ?

Is there a Roadmap published on GitHub ?

Why this is blocked since a long time ? There is a lack of interest in doing the implementation or it requires a massive code refactoring that is not worth the effort ?

Please let the community know what would help here. Thank you !

Mar 05 '24 08:03 zioproto

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 19 '24 13:06 k8s-triage-robot

/remove-lifecycle stale

Jun 19 '24 16:06 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 17 '24 16:09 k8s-triage-robot

/remove-lifecycle stale

Sep 17 '24 18:09 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 16 '24 18:12 k8s-triage-robot

/remove-lifecycle stale

Dec 16 '24 19:12 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 16 '25 20:03 k8s-triage-robot