cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

AWSMachinePool does not drain nodes during scale-in

Open dthorsen opened this issue 5 years ago • 12 comments
trafficstars

/kind bug

What steps did you take and what happened:

  • Create a workload cluster with the experimental EKS Control Plane.
  • Create a MachinePool with replicas: 5 and create the associated AWSMachinePool resources. (Note: this AWSMachinePool is not managed by cluster-autoscaler)
  • Create a deployment and scale it such that some pods fall on all machines
  • Create a PDB protecting the deployment with maxUnavailable: 1
  • Scale the MachinePool down to replicas: 3

This caused the AWSMachineController to set the DesiredInstances in the ASG to 3 without draining nodes at all. The PDB was not honored, and the EC2 instances were terminated by the ASG immediately.

What did you expect to happen: The nodes should have drained gracefully before the EC2 instances are terminated.

Anything else you would like to add: In the current AWSMachinePool implementation, the instance selection for scale-in is performed at the AutoScalingGroup. This could be fixed in the non-cluster-autoscaler case by modifying AWSMachinePool controller to perform node selection for scale-in, drain the selected nodes, and finally utilize the AWS TerminateInstanceInAutoScalingGroup action while setting the request value ShouldDecrementDesiredCapacity: true

We may want to also consider a lifecycle hook on the autoscaling group that prevents ec2 instance termination until the drain completes. This would help to prevent cases where instances are forcibly terminated without draining when the DesiredInstances values are manipulated via the EC2 console, CLI, or APIs.

Environment:

Cluster-api-provider-aws version: Commit: 3338cd4 Kubernetes version: (use kubectl version): v.1.17.9 OS (e.g. from /etc/os-release): Amazon Linux 2

dthorsen avatar Oct 13 '20 18:10 dthorsen

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Feb 07 '21 14:02 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar Mar 09 '21 15:03 fejta-bot

Chatting with @sedefsavas AWS Node Termination Handler ( https://github.com/aws/aws-node-termination-handler ) can help, but doesn't fully eliminate it - it gives a 2 minute warning.

Sync with CAPZ on MachinePool v.Next

@kschumy , any ideas on what we should do here?

randomvariable avatar Mar 11 '21 19:03 randomvariable

We can follow a similar approach with Openshift's POC about polling termination endpoint: https://github.com/openshift/cluster-api-provider-aws/blob/b4a3478db44ddb554883cf77a9e5f49ffd54fdf4/pkg/termination/handler.go

More on this is discussed in the cluster-api proposal: https://github.com/kubernetes-sigs/cluster-api/pull/3528

sedefsavas avatar Mar 23 '21 03:03 sedefsavas

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

fejta-bot avatar Apr 22 '21 04:04 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 22 '21 04:04 k8s-ci-robot

/reopen /remove-lifecycle rotten

richardcase avatar Mar 22 '23 15:03 richardcase

From office hours 2023-04-03:

  • This will potentially be handled by #4184.
  • Providers refresh will have a weakness as aws only give a small amount of time before termination (same issue with AWSManagedMachinePools)
  • Users expectation is that nodes are drained

/triage accepted /priority important-soon

richardcase avatar Apr 03 '23 16:04 richardcase

Also from office hours discussion:

Users define Pod Disruption Budgets to ensure that their Pods are not voluntarily deleted.

A scale-in of a MachinePool, if it uses the "providers refresh", will always proceed, even if it violates a budget.

For comparison, a scale-in of a MachineDeployment will never proceed if it violates a budget.

dlipovetsky avatar Apr 03 '23 16:04 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Jul 02 '23 17:07 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 23 '24 16:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 22 '24 16:02 k8s-triage-robot

/remove-lifecycle rotten

harveyxia avatar Dec 03 '24 14:12 harveyxia

Is there any momentum around getting this implemented? We make extensive use of AWSMachinePools and need the ability for the Nodes to be drained to avoid disrupting hosted workloads.

harveyxia avatar Dec 03 '24 14:12 harveyxia

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 03 '25 15:03 k8s-triage-robot

/remove-lifecycle stale /priority important-soon

richardcase avatar Mar 04 '25 17:03 richardcase

/help

richardcase avatar Mar 04 '25 17:03 richardcase

@richardcase: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Mar 04 '25 17:03 k8s-ci-robot

Lifecycle hooks are supported via AWSMachinePool.spec.lifecycleHooks by now, and can be used to wait before termination. Combine that with aws-node-termination-handler and you'll get the expected draining and can configure it nicely with timeouts.

@richardcase I think we can close this, since draining is not a machine pool feature – or at least not expected in the near future. The above is an easy, proven solution that avoids CAPA adding this feature which other software can already do.

AndiDog avatar Oct 18 '25 15:10 AndiDog