autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Consider nodes for scaled down in order which avoid unnecessary pod disruptions

Open losipiuk opened this issue 5 years ago • 42 comments

Currently scale down logic does not do much (if anything) to prevent multiple disruptions of pods as nodes are being removed in scale down process. It seems that giving preference for scale down to nodes with less utilization would be better. Current code will give same preference to scale down to any node with utilization less than threshold (default 50%).

losipiuk avatar Jul 17 '18 15:07 losipiuk

Do we know how it will interact with the default scheduler's function? If draining the least utilized nodes will result in 40-50% utilized nodes going above threshold, but draining that node first wouldn't have caused the least utilized one to go above threshold, this would mean removing fewer nodes overall.

aleksandra-malinowska avatar Jul 18 '18 09:07 aleksandra-malinowska

Sure. The final shape of cluster for sure depends on order in which we remove nodes. And the case you pointed out is definitely possible. So we need to think a bit about it and of ordering which we feel is reasonable. Currently we have random one which IMO is one of the worst possible options.

  1. It is undeterministic
  2. Choice made may be really poor as we are not even trying to mitigate that.

losipiuk avatar Jul 20 '18 11:07 losipiuk

Given that the autoscaler can't have a clue about the expected lifetime of a pod, without being hinted by for instance annotations, I think it's really hard to do this right in all cases.

In order to keep code the code from growing to accommodate for all possible/likely scale down scenario's, I'd like to propose two scale down options:

  1. Remove the node which results in the fewest pod migrations. Downside is that the biggest pods will likely have the most migrations.
  2. Remove the newest node. This node is most likely to have the youngest workloads, as in my experience older nodes fill up gradually with stable/long lasting workloads

WebSpider avatar Sep 14 '18 01:09 WebSpider

I think @WebSpider idea makes sense, though I'd like to include pod priority in it. Maybe first point should be 'fewest migration of pods with highest priority' instead. If the user doesn't use priority all pods will have the same priority, so it will be the same as what @WebSpider proposed.

MaciekPytel avatar Sep 14 '18 09:09 MaciekPytel

So, when taking pod priority into account, and choosing 'fewest migrations of pods with highest priority', I think this would best be calculated by:

  1. Not taking into consideration any pods with negative priorities. They indicate an expendable pod anyways, IMO
  2. Then there are two options left, we either calculate the sum of positive priorities of pods that would have to migrate, or we calculate the average priority. The rationale in the latter case would be that we really don't want to touch higher prio pods.

So, when taking the average prio into account, a node with 50 prio 1 pods would be evicted to save 1 pod with prio 25. When not taking this into account, 1 prio 25 pod would get evicted to save 50 prio 1 pods.

I tend to prefer saving the prio 25 pod and evict 25+ pods, but the choice seems arbitrary, really.

WebSpider avatar Sep 14 '18 12:09 WebSpider

I think any function calculating a numerical value like average is arbitrary. I would literally just compare number of pods with highest priority (ie. ignore everything else). If 2 nodes have the same number of pods with highest priority look at second-highest priority and so on (lexicographical comparison).

MaciekPytel avatar Sep 14 '18 13:09 MaciekPytel

IMO any (or mostly any) deterministic routine is better than random behavior we have right now.

One thing worth noting is that CA scaledown algorithm currently considers a subset of potential scaledown candidates as long as the list is long enough. Changing that would be hard without hindering scalability. So even with prioritizing candidates there still is element of non-determinism if the actual list of scaledown candidates is longer than considered one.

losipiuk avatar Sep 14 '18 18:09 losipiuk

IMO any (or mostly any) deterministic routine is better than random behavior we have right now.

Not necessarily - selecting at random avoids cases where get stuck trying to do something that doesn't work. A lot of issues with Cluster Autoscaler we've encountered so far were related to such behaviors (trying to remove the same unregistered node every loop and failing, trying the same scale-up option every loop and failing etc.) This doesn't mean we can't improve it, just that it's not as easy as defining a priority function. Back-off on draining a node will help with this, but I'm not 100% certain it'll be enough.

aleksandra-malinowska avatar Sep 19 '18 12:09 aleksandra-malinowska

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 18 '18 12:12 fejta-bot

/remove-lifecycle stale

losipiuk avatar Dec 18 '18 13:12 losipiuk

Given that the autoscaler can't have a clue about the expected lifetime of a pod, without being hinted by for instance annotations, I think it's really hard to do this right in all cases.

In order to keep code the code from growing to accommodate for all possible/likely scale down scenario's, I'd like to propose two scale down options:

  1. Remove the node which results in the fewest pod migrations. Downside is that the biggest pods will likely have the most migrations.
  2. Remove the newest node. This node is most likely to have the youngest workloads, as in my experience older nodes fill up gradually with stable/long lasting workloads

I'd personally vote for a 3rd option. In an effort to cycle-in new/updated nodes with security patches.. it'd also be nice if we can kill the oldest node during scale in. ie, update a launch config with the latest AMI version.. scale the cluster to 2x it's current node size.. let it organically scale in, dropping the older/vulnerable/unpatched nodes.

n2aws avatar Jan 25 '19 23:01 n2aws

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 29 '19 18:04 fejta-bot

/remove-lifecycle stale

This feature would help me a lot, since my cloud provider always have problems with "old" vms. +1 for the oldest node removal as suggested by @n2aws (or make it possible to configure which strategy to use)

wilsonhipolito avatar Jun 20 '19 16:06 wilsonhipolito

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Sep 18 '19 16:09 fejta-bot

/remove-lifecycle stale

losipiuk avatar Sep 20 '19 06:09 losipiuk

+1 for being able to configure autoscaler to remove oldest nodes first - this is a feature of AWS ASG ("OldestInstance" Termination Policy), however Autoscaler ignores this (rightly), and scales in only underutilized nodes, which in some cases are actually desired to stay online as they were brought up to "cycle" the ASG.

emalihin avatar Sep 23 '19 10:09 emalihin

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 22 '19 10:12 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Jan 21 '20 11:01 fejta-bot

/remove-lifecycle rotten

emalihin avatar Jan 21 '20 12:01 emalihin

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 20 '20 13:04 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar May 20 '20 14:05 fejta-bot

/remove-lifecycle rotten

n2aws avatar May 20 '20 17:05 n2aws

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Aug 18 '20 18:08 fejta-bot

/remove-lifecycle stale

Also, It's insanely frustrating that this bot marks issues as stale. Is there a flag you can add to stop the "auto-stale" action on a legitimate ticket that simply isn't being worked on currently?

n2aws avatar Aug 18 '20 18:08 n2aws

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 16 '20 19:11 fejta-bot

/remove-lifecycle stale

n2aws avatar Nov 16 '20 19:11 n2aws

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Feb 14 '21 20:02 fejta-bot

/remove-lifecycle stale

fliphess avatar Feb 15 '21 09:02 fliphess

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar May 16 '21 09:05 fejta-bot

/remove-lifecycle stale

fliphess avatar May 16 '21 12:05 fliphess