cluster-api-provider-aws Graceful shutdown of machinepool nodes

trafficstars

What type of PR is this? /kind feature

What this PR does / why we need it: Ensures graceful shutdown of nodes in awsmachinepools (AWS ASG). watch lifecycle transition of nodes in awsmachinepools as soon as they enter the termination wait condition, cordon it and evict the pods post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event

Which issue(s) this PR fixes Fixes #2574

Release note:

None.

Feb 25 '22 16:02 akash-gautam

@akash-gautam: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 25 '22 16:02 k8s-ci-robot

Hi @akash-gautam. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 25 '22 16:02 k8s-ci-robot

Hi @akash-gautam, thanks for the PR.

Few considerations here: 1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? 2- Can this implementation be extended to be used when spot instances are supported in MachinePools? 3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here.

I think a proposal or at least an ADR would be good for this issue considering there could be multiple ways to achieve this.

cc @richardcase

Mar 02 '22 08:03 sedefsavas

@sedefsavas 1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? - sorry I didn't get this question can you please explain. 2- Can this implementation be extended to be used when spot instances are supported in MachinePools? - We can but it might be a bit tricky as the spot instances won't follow the life cycle states that on-demand instances follow in an ASG, we are creating a pre-deletion hook on the machine pool asg which will prevent the instance from getting terminated for 15 mins in case of a scale in event but for spot instance we will get only a 2 min notice (even less if the interuption action is hibernate) . 3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here. - Our current implementaion is for lifecycle events of ec2 instance when it is managed by ec2 but for this we needs lifecycle events of ec2 instances which is managed by ASG. The lifecycle of ec2 instances differs based on whether they are managed by ec2 or asg. I also first tried using the instancestate package that we already had but for it the source of events is aws.ec2 which doesn't capture the lifecycle states like terminating:wait that is why I had to create a new package which is quite similar to it but caputues the events from aws.autoscaling.

Mar 07 '22 17:03 akash-gautam

this is very nice feature btw 💯

Mar 20 '22 19:03 dntosas

@sedefsavas any news on here?

Apr 05 '22 12:04 dntosas

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

@Ankitasw do you want to start a document or ADR for redesigning eventbridge flexible enough to reuse it for instance status updates + ASG scale in + spot instance termination notices? We can collaborate there and once that work is done, draining logic will just do the draining every time a relevant event is observed.

In the current eventbridge implementation, when an event is received, relevant machine reconciliation is triggered. This was enough for this use case, because machine controller can see the instance status in the reconciliation so no need to know the nature of the event. But for scale in events, probably there is no instance status change, hence events need to be processed at the controller itself.

Apr 12 '22 01:04 sedefsavas

@sedefsavas yes I am already working on a plan on how to graduate the EventsBridge to cater status update of instances, ASG scale-in and spot instance termination notice, so as to apply those to draining logic in MachinePools and Spot Instances.

Apr 12 '22 06:04 Ankitasw

How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.

Yes i think thats a great idea @sedefsavas :+1:

Apr 12 '22 08:04 richardcase

@Ankitasw If there are any tasks/issues in the graduating the EventsBridge plan that I can take up, please let me know.

Apr 12 '22 11:04 akash-gautam

Created an issue for the eventbridge graduation https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3414

Apr 13 '22 03:04 sedefsavas

/ok-to-test

Jun 17 '22 07:06 richardcase

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval by writing /assign @cecilerobertmichon in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 30 '22 19:08 k8s-ci-robot

@akash-gautam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-build-release-1-5	ab9ae2a11e633550e6f668f5dbecaf81ca1ab106	link	true	`/test pull-cluster-api-provider-aws-build-release-1-5`
pull-cluster-api-provider-aws-test	2e63b89874e08445f722e419eaa7d11836ea3e72	link	true	`/test pull-cluster-api-provider-aws-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Oct 13 '22 14:10 k8s-ci-robot

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. /close

Oct 14 '22 06:10 Ankitasw

@Ankitasw: Closed this PR.

In response to this:

@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.

As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 14 '22 06:10 k8s-ci-robot

@Ankitasw Okay, thanks for the info.

Oct 14 '22 08:10 akash-gautam

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Graceful shutdown of machinepool nodes

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard