autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

cluster autoscaler should consider availability zone balancing during scaledown

Open rittneje opened this issue 5 years ago • 37 comments

We are running a cluster in AWS EKS that uses nodes from auto-scaling groups. We have noticed that whenever the autoscaler terminates a node during scaledown, the auto-scaling group triggers an availability zone rebalancing shortly thereafter. This in turn leads to a spike in errors. It would be preferable if the cluster autoscaler properly considered availability zones during scaledown, shuffling pods between nodes as necessary to preemptively avoid a rebalancing.

rittneje avatar Nov 14 '20 16:11 rittneje

We are also seeing this even after removing --balance-similar-node-groups suggested in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#common-notes-and-gotchas.

May be we'll give Suspended processes in ASG console a try.

knkarthik avatar Nov 23 '20 23:11 knkarthik

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Feb 21 '21 23:02 fejta-bot

/remove-lifecycle stale

rittneje avatar Feb 21 '21 23:02 rittneje

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar May 22 '21 23:05 fejta-bot

/remove-lifecycle stale

rittneje avatar May 23 '21 00:05 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 21 '21 00:08 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Aug 21 '21 00:08 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 14 '21 16:12 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Dec 14 '21 18:12 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 14 '22 19:03 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Mar 14 '22 19:03 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 12 '22 19:06 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Jun 12 '22 20:06 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 10 '22 20:09 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Sep 11 '22 01:09 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 10 '22 02:12 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Dec 10 '22 02:12 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 10 '23 03:03 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Mar 10 '23 03:03 rittneje

Hi all, are there news in the meantime? We're experiencing this issue I think, where the aws autoscaling rebalances and consequently scales down for overprovisioning, independently from the cluster-autoscaler work:

MidTerminatingLifecycleAction
	Terminating EC2 instance: i-XYZ	At 2023-05-08T12:08:31Z instances were launched to balance instances in zones eu-west-1b eu-west-1a with other zones resulting in more than desired number of instances in the group.
	At 2023-05-08T12:08:42Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 12 to 11.
	At 2023-05-08T12:08:42Z instance i-XYZ was selected for termination.

The result is that AWS autoscaling seems to spawn new instances for balancing the avilabilty across the AZs, but exceeding the desired count of instances, it terminates an instance to match the desired count (managed by the cluster-autoscaler), obviously bypassing the cluster-autoscaler.

Furthermore I have workload that can't be evicted (and for which set the cluster-autoscaler.kubernetes.io/safe-to-evict: "false"), and the AWS autoscalig is agnostic to that obviously.

Am I missing something?

maxgio92 avatar May 08 '23 17:05 maxgio92

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 19 '24 21:01 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Jan 19 '24 23:01 rittneje

Pinging repo approvers about this feature request. @mwielgus @MaciekPytel @gjtempleton

This seems to be a valid issue. It is documented in this repo documentation here:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#im-running-cluster-with-nodes-in-multiple-zones-for-ha-purposes-is-that-supported-by-cluster-autoscaler

Currently the balancing is only done at scale-up. Cluster Autoscaler will still scale down underutilized nodes regardless of the relative sizes of underlying node groups. We plan to take balancing into account in scale-down in the future.

Is the sentence "We plan to take balancing into account in scale-down in the future" still valid ?

Is there a Roadmap published on GitHub ?

Why this is blocked since a long time ? There is a lack of interest in doing the implementation or it requires a massive code refactoring that is not worth the effort ?

Please let the community know what would help here. Thank you !

zioproto avatar Mar 05 '24 08:03 zioproto

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 19 '24 13:06 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Jun 19 '24 16:06 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 17 '24 16:09 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Sep 17 '24 18:09 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 16 '24 18:12 k8s-triage-robot

/remove-lifecycle stale

rittneje avatar Dec 16 '24 19:12 rittneje

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 16 '25 20:03 k8s-triage-robot