cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
Allow a user to enable EKS MachinePool "auto-healing" upon Node Group scaling failure
/kind feature
Describe the solution you'd like [A clear and concise description of what you want to happen.] A user should have the option to enable "auto-healing" upon Node Group scaling failures (IE: when a Node Group scaling operation fails, revert to previously successful state). Currently CAPA will continue to attempt to reconcile the Node Group into the desired (in this case, failing) state.
When a scaling operation is triggered via the AWS UI and the scaling operation fails, AWS sees the failure in the Auto Scaling Group that backs the Node Group, and reverts the Node Group config to the previously successful state. A user should have the ability to enable this flow as it is what is available in AWS and in my mind is a good UX.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] A way to run in to this issue is as follows:
- Have a Service Quota limit set for a specific instance type that restricts vCPUs that can be used (in my case I had a Service Quota limit of 64 vCPUs for g instance types and am using g5.8xlarge instance types for my Node Group).
- Attempt to scale the Node Group to a number that exceeds the Service Quota limit set (ie: 1->3 nodes).
- CAPA will continue to attempt to reconcile to 3 nodes and continue to fail until Service Quota limit is raised
However, if I perform the same scaling operation via the AWS UI, once the first failure is shown in the AutoScalingGroup's Acitivty log, it reverts the Node Group's configuration to the last successful/positive state.
This same error flow can be ran in to when AWS runs out of compute resources in your given Availability Zone.
Screenshot showing the Auto Scaling Group activity log for when a failing scaling operation is triggered via the UI and the revert of config that occurs. I see the same flow occur (revert to previous successful state) when the Availability Zone does not have enough compute resources to launch the required instances.
Screenshot showing Auto Scaling Group activity log for when a failing scaling operation is triggered via CAPA. Note that it will continue to repeatedly fail as it attempts to reconcile to desired (failing) state.
Environment:
- Cluster-api-provider-aws version: v2.3.1
- Kubernetes version: (use
kubectl version): v1.28 - OS (e.g. from
/etc/os-release):
This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.