autoscaler
autoscaler copied to clipboard
Consider nodes for scaled down in order which avoid unnecessary pod disruptions
Currently scale down logic does not do much (if anything) to prevent multiple disruptions of pods as nodes are being removed in scale down process. It seems that giving preference for scale down to nodes with less utilization would be better. Current code will give same preference to scale down to any node with utilization less than threshold (default 50%).
Do we know how it will interact with the default scheduler's function? If draining the least utilized nodes will result in 40-50% utilized nodes going above threshold, but draining that node first wouldn't have caused the least utilized one to go above threshold, this would mean removing fewer nodes overall.
Sure. The final shape of cluster for sure depends on order in which we remove nodes. And the case you pointed out is definitely possible. So we need to think a bit about it and of ordering which we feel is reasonable. Currently we have random one which IMO is one of the worst possible options.
- It is undeterministic
- Choice made may be really poor as we are not even trying to mitigate that.
Given that the autoscaler can't have a clue about the expected lifetime of a pod, without being hinted by for instance annotations, I think it's really hard to do this right in all cases.
In order to keep code the code from growing to accommodate for all possible/likely scale down scenario's, I'd like to propose two scale down options:
- Remove the node which results in the fewest pod migrations. Downside is that the biggest pods will likely have the most migrations.
- Remove the newest node. This node is most likely to have the youngest workloads, as in my experience older nodes fill up gradually with stable/long lasting workloads
I think @WebSpider idea makes sense, though I'd like to include pod priority in it. Maybe first point should be 'fewest migration of pods with highest priority' instead. If the user doesn't use priority all pods will have the same priority, so it will be the same as what @WebSpider proposed.
So, when taking pod priority into account, and choosing 'fewest migrations of pods with highest priority', I think this would best be calculated by:
- Not taking into consideration any pods with negative priorities. They indicate an expendable pod anyways, IMO
- Then there are two options left, we either calculate the sum of positive priorities of pods that would have to migrate, or we calculate the average priority. The rationale in the latter case would be that we really don't want to touch higher prio pods.
So, when taking the average prio into account, a node with 50 prio 1 pods would be evicted to save 1 pod with prio 25. When not taking this into account, 1 prio 25 pod would get evicted to save 50 prio 1 pods.
I tend to prefer saving the prio 25 pod and evict 25+ pods, but the choice seems arbitrary, really.
I think any function calculating a numerical value like average is arbitrary. I would literally just compare number of pods with highest priority (ie. ignore everything else). If 2 nodes have the same number of pods with highest priority look at second-highest priority and so on (lexicographical comparison).
IMO any (or mostly any) deterministic routine is better than random behavior we have right now.
One thing worth noting is that CA scaledown algorithm currently considers a subset of potential scaledown candidates as long as the list is long enough. Changing that would be hard without hindering scalability. So even with prioritizing candidates there still is element of non-determinism if the actual list of scaledown candidates is longer than considered one.
IMO any (or mostly any) deterministic routine is better than random behavior we have right now.
Not necessarily - selecting at random avoids cases where get stuck trying to do something that doesn't work. A lot of issues with Cluster Autoscaler we've encountered so far were related to such behaviors (trying to remove the same unregistered node every loop and failing, trying the same scale-up option every loop and failing etc.) This doesn't mean we can't improve it, just that it's not as easy as defining a priority function. Back-off on draining a node will help with this, but I'm not 100% certain it'll be enough.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Given that the autoscaler can't have a clue about the expected lifetime of a pod, without being hinted by for instance annotations, I think it's really hard to do this right in all cases.
In order to keep code the code from growing to accommodate for all possible/likely scale down scenario's, I'd like to propose two scale down options:
- Remove the node which results in the fewest pod migrations. Downside is that the biggest pods will likely have the most migrations.
- Remove the newest node. This node is most likely to have the youngest workloads, as in my experience older nodes fill up gradually with stable/long lasting workloads
I'd personally vote for a 3rd option. In an effort to cycle-in new/updated nodes with security patches.. it'd also be nice if we can kill the oldest node during scale in. ie, update a launch config with the latest AMI version.. scale the cluster to 2x it's current node size.. let it organically scale in, dropping the older/vulnerable/unpatched nodes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
This feature would help me a lot, since my cloud provider always have problems with "old" vms. +1 for the oldest node removal as suggested by @n2aws (or make it possible to configure which strategy to use)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
+1 for being able to configure autoscaler to remove oldest nodes first - this is a feature of AWS ASG ("OldestInstance" Termination Policy), however Autoscaler ignores this (rightly), and scales in only underutilized nodes, which in some cases are actually desired to stay online as they were brought up to "cycle" the ASG.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Also, It's insanely frustrating that this bot marks issues as stale. Is there a flag you can add to stop the "auto-stale" action on a legitimate ticket that simply isn't being worked on currently?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale