cluster-api-provider-aws Allow a user to enable EKS MachinePool "auto-healing" upon Node Group scaling failure

trafficstars

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.] A user should have the option to enable "auto-healing" upon Node Group scaling failures (IE: when a Node Group scaling operation fails, revert to previously successful state). Currently CAPA will continue to attempt to reconcile the Node Group into the desired (in this case, failing) state.

When a scaling operation is triggered via the AWS UI and the scaling operation fails, AWS sees the failure in the Auto Scaling Group that backs the Node Group, and reverts the Node Group config to the previously successful state. A user should have the ability to enable this flow as it is what is available in AWS and in my mind is a good UX.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] A way to run in to this issue is as follows:

Have a Service Quota limit set for a specific instance type that restricts vCPUs that can be used (in my case I had a Service Quota limit of 64 vCPUs for g instance types and am using g5.8xlarge instance types for my Node Group).
Attempt to scale the Node Group to a number that exceeds the Service Quota limit set (ie: 1->3 nodes).
CAPA will continue to attempt to reconcile to 3 nodes and continue to fail until Service Quota limit is raised

However, if I perform the same scaling operation via the AWS UI, once the first failure is shown in the AutoScalingGroup's Acitivty log, it reverts the Node Group's configuration to the last successful/positive state.

This same error flow can be ran in to when AWS runs out of compute resources in your given Availability Zone.

Screenshot showing the Auto Scaling Group activity log for when a failing scaling operation is triggered via the UI and the revert of config that occurs. I see the same flow occur (revert to previous successful state) when the Availability Zone does not have enough compute resources to launch the required instances. autohealedASGScaledBackDown

Screenshot showing Auto Scaling Group activity log for when a failing scaling operation is triggered via CAPA. Note that it will continue to repeatedly fail as it attempts to reconcile to desired (failing) state. scalingViaCAPADoesNotAutoHeal

Environment:

Cluster-api-provider-aws version: v2.3.1
Kubernetes version: (use kubectl version): v1.28
OS (e.g. from /etc/os-release):

Feb 05 '24 18:02 stefanSpectro

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 05 '24 18:02 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 05 '24 19:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 04 '24 19:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jul 04 '24 20:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 04 '24 20:07 k8s-ci-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Allow a user to enable EKS MachinePool "auto-healing" upon Node Group scaling failure

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard