cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
Graceful shutdown of machinepool nodes
What type of PR is this? /kind feature
What this PR does / why we need it: Ensures graceful shutdown of nodes in awsmachinepools (AWS ASG). watch lifecycle transition of nodes in awsmachinepools as soon as they enter the termination wait condition, cordon it and evict the pods post eviction of pods complete the lifecycle hook that prevents immediate deletion of node on scale in event
Which issue(s) this PR fixes Fixes #2574
Release note:
None.
@akash-gautam: This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @akash-gautam. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @akash-gautam, thanks for the PR.
Few considerations here: 1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? 2- Can this implementation be extended to be used when spot instances are supported in MachinePools? 3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here.
I think a proposal or at least an ADR would be good for this issue considering there could be multiple ways to achieve this.
cc @richardcase
@sedefsavas
1- What would be the benefit of using MachinePool Machines for graceful shutdowns like CAPZ? - sorry I didn't get this question can you please explain.
2- Can this implementation be extended to be used when spot instances are supported in MachinePools? - We can but it might be a bit tricky as the spot instances won't follow the life cycle states that on-demand instances follow in an ASG, we are creating a pre-deletion hook on the machine pool asg which will prevent the instance from getting terminated for 15 mins in case of a scale in event but for spot instance we will get only a 2 min notice (even less if the interuption action is hibernate) .
3- We already have an eventbridge implementation, but that is in experimental stage. We can graduate it to reuse here. - Our current implementaion is for lifecycle events of ec2 instance when it is managed by ec2 but for this we needs lifecycle events of ec2 instances which is managed by ASG. The lifecycle of ec2 instances differs based on whether they are managed by ec2 or asg. I also first tried using the instancestate package that we already had but for it the source of events is aws.ec2 which doesn't capture the lifecycle states like terminating:wait that is why I had to create a new package which is quite similar to it but caputues the events from aws.autoscaling.
this is very nice feature btw 💯
@sedefsavas any news on here?
How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.
@Ankitasw do you want to start a document or ADR for redesigning eventbridge flexible enough to reuse it for instance status updates + ASG scale in + spot instance termination notices? We can collaborate there and once that work is done, draining logic will just do the draining every time a relevant event is observed.
In the current eventbridge implementation, when an event is received, relevant machine reconciliation is triggered. This was enough for this use case, because machine controller can see the instance status in the reconciliation so no need to know the nature of the event. But for scale in events, probably there is no instance status change, hence events need to be processed at the controller itself.
@sedefsavas yes I am already working on a plan on how to graduate the EventsBridge to cater status update of instances, ASG scale-in and spot instance termination notice, so as to apply those to draining logic in MachinePools and Spot Instances.
How about graduating eventbridge experimental feature and repurpose it to get event notifications for ASG scale in as well? This will be also useful when we want to implement spot instance draining utilizing Spot Instance Termination Notice events.
Yes i think thats a great idea @sedefsavas :+1:
@Ankitasw If there are any tasks/issues in the graduating the EventsBridge plan that I can take up, please let me know.
Created an issue for the eventbridge graduation https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3414
/ok-to-test
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval by writing /assign @cecilerobertmichon in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@akash-gautam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cluster-api-provider-aws-build-release-1-5 | ab9ae2a11e633550e6f668f5dbecaf81ca1ab106 | link | true | /test pull-cluster-api-provider-aws-build-release-1-5 |
| pull-cluster-api-provider-aws-test | 2e63b89874e08445f722e419eaa7d11836ea3e72 | link | true | /test pull-cluster-api-provider-aws-test |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.
As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. /close
@Ankitasw: Closed this PR.
In response to this:
@akash-gautam apologies for replying late, but we have a planned a phased approach now to enable this first in EventBridge and then a separate proposal for node draining. This is the ADR for graduating EventBridge to accommodate different types of events. I would let you know once we start that feature, and see if you could make contributions on the same.
As of now closing this PR to minimize duplicate efforts, we could always revisit PR if we want to reuse some bits. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@Ankitasw Okay, thanks for the info.