cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Graceful shutdown of machinepool nodes

Open akash-gautam opened this issue 3 years ago • 15 comments
trafficstars

What type of PR is this? /kind feature

What this PR does / why we need it: Ensures graceful shutdown of nodes in awsmachinepools (AWS ASG). watch lifecycle transition of nodes in awsmachinepools as soon as they enter the termination wait condition, cordon it and evict the pods post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event

Which issue(s) this PR fixes Fixes #2574

Release note:

None.

akash-gautam avatar Feb 25 '22 16:02 akash-gautam

@akash-gautam: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 25 '22 16:02 k8s-ci-robot

Hi @akash-gautam. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 25 '22 16:02 k8s-ci-robot

Hi @akash-gautam, thanks for the PR.

Few considerations here: 1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? 2- Can this implementation be extended to be used when spot instances are supported in MachinePools? 3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here.

I think a proposal or at least an ADR would be good for this issue considering there could be multiple ways to achieve this.

cc @richardcase

sedefsavas avatar Mar 02 '22 08:03 sedefsavas

@sedefsavas 1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? - sorry I didn't get this question can you please explain. 2- Can this implementation be extended to be used when spot instances are supported in MachinePools? - We can but it might be a bit tricky as the spot instances won't follow the life cycle states that on-demand instances follow in an ASG, we are creating a pre-deletion hook on the machine pool asg which will prevent the instance from getting terminated for 15 mins in case of a scale in event but for spot instance we will get only a 2 min notice (even less if the interuption action is hibernate) . 3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here. - Our current implementaion is for lifecycle events of ec2 instance when it is managed by ec2 but for this we needs lifecycle events of ec2 instances which is managed by ASG. The lifecycle of ec2 instances differs based on whether they are managed by ec2 or asg. I also first tried using the instancestate package that we already had but for it the source of events is aws.ec2 which doesn't capture the lifecycle states like terminating:wait that is why I had to create a new package which is quite similar to it but caputues the events from aws.autoscaling.

akash-gautam avatar Mar 07 '22 17:03 akash-gautam

this is very nice feature btw 💯

dntosas avatar Mar 20 '22 19:03 dntosas

@sedefsavas any news on here?

dntosas avatar Apr 05 '22 12:04 dntosas

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

@Ankitasw do you want to start a document or ADR for redesigning eventbridge flexible enough to reuse it for instance status updates + ASG scale in + spot instance termination notices? We can collaborate there and once that work is done, draining logic will just do the draining every time a relevant event is observed.

In the current eventbridge implementation, when an event is received, relevant machine reconciliation is triggered. This was enough for this use case, because machine controller can see the instance status in the reconciliation so no need to know the nature of the event. But for scale in events, probably there is no instance status change, hence events need to be processed at the controller itself.

sedefsavas avatar Apr 12 '22 01:04 sedefsavas

@sedefsavas yes I am already working on a plan on how to graduate the EventsBridge to cater status update of instances, ASG scale-in and spot instance termination notice, so as to apply those to draining logic in MachinePools and Spot Instances.

Ankitasw avatar Apr 12 '22 06:04 Ankitasw

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

Yes i think thats a great idea @sedefsavas :+1:

richardcase avatar Apr 12 '22 08:04 richardcase

@Ankitasw If there are any tasks/issues in the graduating the EventsBridge plan that I can take up, please let me know.

akash-gautam avatar Apr 12 '22 11:04 akash-gautam

Created an issue for the eventbridge graduation https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3414

sedefsavas avatar Apr 13 '22 03:04 sedefsavas

/ok-to-test

richardcase avatar Jun 17 '22 07:06 richardcase

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval by writing /assign @cecilerobertmichon in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 30 '22 19:08 k8s-ci-robot

@akash-gautam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-release-1-5 ab9ae2a11e633550e6f668f5dbecaf81ca1ab106 link true /test pull-cluster-api-provider-aws-build-release-1-5
pull-cluster-api-provider-aws-test 2e63b89874e08445f722e419eaa7d11836ea3e72 link true /test pull-cluster-api-provider-aws-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Oct 13 '22 14:10 k8s-ci-robot

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. /close

Ankitasw avatar Oct 14 '22 06:10 Ankitasw

@Ankitasw: Closed this PR.

In response to this:

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 14 '22 06:10 k8s-ci-robot

@Ankitasw Okay, thanks for the info.

akash-gautam avatar Oct 14 '22 08:10 akash-gautam