autoscaler add option to keep node group backoff on OutOfResource error

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new flag node-group-keep-backoff-out-of-resources aimed at providing users with control over cluster autoscaler's (CA) backoff strategy when a node group scale up operation fails due to an out-of-resources error from the cloud provider.

With this flag enabled, CA will respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up. This differs from the current default behavior, which could silently and prematurely remove the backoff, leading the CA to repeatedly attempt scaling up an exhausted node group. This could significantly delay scale up time, especially for larger scale ups. This change is useful for clusters where node groups utilize scarce instances or the cloud provider has long wait times for provisioning additional capacity.

By configuring node-group-keep-backoff-out-of-resources in tandem with initial-node-group-backoff-duration and max-node-group-backoff-duration, users could optimize the scale up strategy to minimize attempts and delays by allowing the CA to try scaling up different node groups while the cloud provider provisions additional instances.

Example

priority-expander config:

priorities:
----
50: 
  - .*compute-1.*
40: 
  - .*compute-2.*
30: 
  - .*compute-3.*
20: 
  - .*compute-4.*
10: 
  - .*compute-5.*

CA made ASG scale up attempts in this order: compute-1 -> compute-2 -> compute-3 -> compute-1 -> compute-2. ASGs compute-4 and compute-5 were never attempted before recycling through the out of capacity ASGs. The second attempts to scale compute-1 and compute-2 were made prematurely due to the backoffs being removed after the scale up requests were finished instead of at expiration. With the proposed feature, the CA would respect the entire backoff duration for compute-1, compute-2 and compute-3, potentially allowing for scale up attempts on compute-4 and compute-5.

I0511 21:28:03.431934       1 priority.go:163] priority expander: compute-1-AutoScalingGroup chosen as the highest available
W0511 21:29:05.806767       1 auto_scaling_groups.go:453] Instance group compute-1-AutoScalingGroup cannot provision any more nodes!
I0511 21:29:05.854410       1 clusterstate.go:1084] Failed adding 68 nodes (68 unseen previously) to group compute-1-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:29:05.854451       1 clusterstate.go:294] Disabling scale-up for node group compute-1-AutoScalingGroup until 2023-05-11 21:34:04.970892339 +0000 UTC m=+448.035594703; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:29:30.133720       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:29:30.875841       1 priority.go:163] priority expander: compute-2-AutoScalingGroup chosen as the highest available
W0511 21:30:33.170140       1 auto_scaling_groups.go:453] Instance group compute-2-AutoScalingGroup cannot provision any more nodes!
I0511 21:30:33.215512       1 clusterstate.go:1084] Failed adding 60 nodes (60 unseen previously) to group compute-2-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:30:33.215547       1 clusterstate.go:294] Disabling scale-up for node group compute-2-AutoScalingGroup until 2023-05-11 21:35:32.568995506 +0000 UTC m=+535.633697859; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:30:54.139919       1 orchestrator.go:511] Node group compute-2-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:30:54.469904       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:30:55.164562       1 priority.go:163] priority expander: compute-3-AutoScalingGroup chosen as the highest available
# Removing compute-1-AutoScalingGroup backoff before expiration at 21:34:04.970892339
I0511 21:31:36.935048       1 clusterstate.go:265] Scale up in group compute-1-AutoScalingGroup finished successfully in 3m23.087469244s
W0511 21:31:58.677023       1 auto_scaling_groups.go:453] Instance group compute-3-AutoScalingGroup cannot provision any more nodes!
I0511 21:31:58.724134       1 clusterstate.go:1084] Failed adding 118 nodes (118 unseen previously) to group compute-3-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:31:58.724163       1 clusterstate.go:294] Disabling scale-up for node group compute-3-AutoScalingGroup until 2023-05-11 21:36:57.970873708 +0000 UTC m=+621.035576071; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled

W0511 21:32:29.184151       1 orchestrator.go:511] Node group compute-3-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:32:29.635896       1 orchestrator.go:511] Node group compute-2-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:32:30.634765       1 priority.go:163] priority expander: compute-1-AutoScalingGroup chosen as the highest available
# Removing compute-2-AutoScalingGroup backoff before expiration at 21:35:32.568995506
I0511 21:33:23.049938       1 clusterstate.go:265] Scale up in group compute-2-AutoScalingGroup finished successfully in 3m51.568357773s
W0511 21:33:33.928385       1 auto_scaling_groups.go:453] Instance group compute-1-AutoScalingGroup cannot provision any more nodes!
I0511 21:33:33.979834       1 clusterstate.go:1084] Failed adding 40 nodes (40 unseen previously) to group compute-1-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:33:33.979877       1 clusterstate.go:294] Disabling scale-up for node group compute-1-AutoScalingGroup until 2023-05-11 21:38:33.206839958 +0000 UTC m=+716.271542331; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled


W0511 21:33:50.765746       1 orchestrator.go:511] Node group compute-3-AutoScalingGroup is not ready for scaleup - backoff
W0511 21:33:51.696754       1 orchestrator.go:511] Node group compute-1-AutoScalingGroup is not ready for scaleup - backoff
I0511 21:33:52.256395       1 priority.go:163] priority expander: compute-2-AutoScalingGroup chosen as the highest available
# Removing compute-3-AutoScalingGroup backoff before expiration at 21:36:57.970873708
I0511 21:34:02.528498       1 clusterstate.go:265] Scale up in group compute-3-AutoScalingGroup finished successfully in 2m56.526591463s
W0511 21:34:54.470085       1 auto_scaling_groups.go:453] Instance group compute-2-AutoScalingGroup cannot provision any more nodes!
I0511 21:34:54.520000       1 clusterstate.go:1084] Failed adding 47 nodes (47 unseen previously) to group compute-2-AutoScalingGroup due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
W0511 21:34:54.520033       1 clusterstate.go:294] Disabling scale-up for node group compute-2-AutoScalingGroup until 2023-05-11 21:39:53.877496735 +0000 UTC m=+796.942199088; errorClass=OutOfResource; errorCode=placeholder-cannot-be-fulfilled

Which issue(s) this PR fixes:

Fixes #2730

Special notes for your reviewer:

All existing unit tests pass
Tested on AWS EKS cluster

Does this PR introduce a user-facing change?

Add flag `node-group-keep-backoff-out-of-resources` which instructs CA to respect the entire duration of the backoff period for a node group that has hit an out-of-resources error during scale up.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

May 12 '23 05:05 wllbo

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: wllbo / name: Will Bowers (aa1af0386254f91b4755f46e3f3d6a2edaded11c, 8a2cae379578b235a239e5ff03eb312a3e7b8547, 00fd3a802ccb4a49a9a9e4ce07e1e64af7d51789, 8e867f66c5ccedaee35142112c7de36a237c6878, 4477707256f3fb011e39887f9d302f2c1f1733b4)

May 12 '23 05:05 linux-foundation-easycla[bot]

Welcome @wllbo!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

May 12 '23 05:05 k8s-ci-robot

Hi @wllbo, Please sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

May 12 '23 11:05 Shubham82

@wllbo Excellent job on this! I'd really like this merged onto the main and use the feature.

Aug 09 '23 18:08 gyanesh-mishra

/assign @towca

Sep 01 '23 16:09 towca

I fully get the motivation for this change, but IMO the implementation goes in the wrong direction in terms of readability. Unfortunately CSR is both very convoluted, and critical to CA's core functionality - we should be extra careful not to make it less readable than it already is.

The current behavior is IMO surprising while reading the code (I also don't fully get why we need it in the first place, but that's another story). The initial feel when reading the code is that if any node for a given scale-up fails to come up, the node group is backed off. You have to parse a lot of logic to get to the "oh so a partially failed scale-up actually resets the backoff very quickly" realization. This change would add another mental step on top of that - "oh, unless it's a resource-related error, then the backoff is kept).

WDYT about something like below instead? I think it'd increase the readability slightly, while still achieving the same result.

Attach the encountered errors to ScaleUpRequest instead of the backoff. This has the added benefit of not changing the Backoff interface just for this one purpose.
In the part after the scale-up finishes, rename things so that it's not "keeping backoff sometimes", but rather "removing the backoff early sometimes, based on the encountered errors (if there haven't been any likely-to-be-persistent (e.g. OutOfResources) errors encountered)".
I'd also try to explain what's happening in that part in the first place - e.g. a comment like "A node group can be backed off in the middle of an ongoing scale-up because of instance creation errors. If the errors are not likely to be persistent, and the last instance we see comes up successfully, the backoff is removed quickly. This allows CA to retry scaling up the failed instances in the same node group. If we have pods targeting only a particular node group, and the instance creation errors are temporary, this minimizes the time the pods stay pending - at the cost of potentially making one more scale-up that fails.".

Given CSR's nature explained above, this change should also be well-covered with unit tests.

Sep 04 '23 15:09 towca

@towca thank you for the detailed feedback on the PR. Based on your suggestions, I've made the following changes:

attached the encountered errors to ScaleUpRequest instead of modifying the backoff itself.
refactored logic to emphasize the early removal of a backoff based on the encountered errors, instead of sometimes "keeping" it.
added a comment explaining the rationale for the behavior, based on the example you provided. This should provide clearer context to future readers of the code.
added additional unit tests to cover the new behavior and changes introduced.

Please let me know if there are any further changes or clarifications I should make

Oct 18 '23 19:10 wllbo

@wllbo Please rebase the PR, so that merge conflicts will resolve.

Nov 24 '23 07:11 Shubham82

@wllbo Can you please rebase the PR to resolve merge conflicts. Thanks in advance.

Feb 05 '24 15:02 kmsarabu

Hey @wllbo, sorry for the delay, this PR fell off my radar completely :(

I've discussed this with the author of the original code, and it seems that the current behavior (clearing the backoff whenever a scale-up ends) doesn't make sense anymore. Back in the early days of CA when it was written scale-ups were done 1 node at a time, and there weren't many different node configs available in cloud providers. Clearing the backoff early was the safe choice - there was likely no other node group to fall back to anyway, and so the only cost was potentially 1 more scale-up that would fail.

Now, many more node configurations are available, and CA has to fall back between them - so the cost of clearing the backoff early is less effective fallback. Obviously there can be multiple nodes in a single scale-up now, and we'll probably always see the errors from the VMs that failed before we see the other VMs come up. So we'll just always cut the backoff short in the event of a partial scale-up failure, which wasn't the original intention at all.

All this to say: neither me nor the original author see any clear advantages to resetting the backoff early anymore. Could you just remove that part from updateScaleRequests, instead of making it configurable? This also has the benefit of not adding another flag/config option, which CA has way too many of already. Sorry for wasting your time with the previous suggestion.

Feb 09 '24 16:02 towca

Hey @towca, thank you for getting back to me on this, and no worries about the delay. I completely agree with the benefits of only removing the one line instead. I was hesitant to remove it outright due to its longstanding presence in the code. However, your discussion with the original author and the explanation provided have addressed this concern, thanks for doing that. I've reverted my previous changes, removed the RemoveBackoff line from updateScaleRequests, and updated the test case to reflect this change in logic.

Feb 13 '24 15:02 wllbo

Thank you! /lgtm /approve

Feb 13 '24 18:02 towca

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca, wllbo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 13 '24 18:02 k8s-ci-robot

autoscaler autoscaler copied to clipboard

add option to keep node group backoff on OutOfResource error

What type of PR is this?

What this PR does / why we need it:

Example

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

autoscaler
autoscaler copied to clipboard