cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
Allow garbage collector to delete ec2 instances
What type of PR is this?
What this PR does / why we need it:
/kind feature
Addition to the experimental gc by @richardcase, we should also see if we can scan the CAPA owned tags, to make sure we don't have any leftover. Thoughts?
Release note:
NONE
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
/test ?
@vincepri: The following commands are available to trigger required jobs:
/test pull-cluster-api-provider-aws-build/test pull-cluster-api-provider-aws-test/test pull-cluster-api-provider-aws-verify
The following commands are available to trigger optional jobs:
/test pull-cluster-api-provider-aws-apidiff-main/test pull-cluster-api-provider-aws-e2e/test pull-cluster-api-provider-aws-e2e-blocking/test pull-cluster-api-provider-aws-e2e-clusterclass/test pull-cluster-api-provider-aws-e2e-conformance/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts/test pull-cluster-api-provider-aws-e2e-eks/test pull-cluster-api-provider-aws-e2e-eks-gc/test pull-cluster-api-provider-aws-e2e-eks-testing
Use /test all to run the following jobs that were automatically triggered:
pull-cluster-api-provider-aws-apidiff-mainpull-cluster-api-provider-aws-buildpull-cluster-api-provider-aws-testpull-cluster-api-provider-aws-verify
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/test pull-cluster-api-provider-aws-e2e-eks-gc
Actually seems these tests have been failing for quite some time https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#pr-e2e-eks-gc-main
FWIW, I've tested the logic in a custom cluster by creating an ec2 instance outside of CAPA, and it deleted properly
@vincepri - i'm curious in what scenarios there would be an EC2 instance created that isn't managed by CAPA itself (i.e. via Machines/MachinePools)?
The original idea of the GC was for cleaning up resources that where created by the CCM as this can block CAPA deleting a cluster. So, things like an application being deployed that has a service of type load balancer.
@richardcase A few use cases I had in mind:
- This would allow to cleanup cluster leftovers; especially if we do expand the filters to include CAPA owned tags.
- Allows to better support OpenShift, which creates resources using an old version of Cluster API, that tags resources with the cloud provider.
We've certainly seen users on AWS in the past create EC2 instances by themselves and join them to OpenShift clusters when the tooling within Kube/OpenShift didn't support a feature that they needed. I expect we aren't the only people seeing users do that
Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster? I don't think anything would stop them, and we should expect that this has happened somewhere before, where we have feature gaps
/milestone v2.3.0
@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.
In response to this:
/milestone v2.3.0
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster?
If the user created the EC2 instances, I think we can argue that the user is responsible for deleting them.
However, I think we can argue that the purpose of the garbage collector is to unblock cluster deletion. If a user creates an EC2 instance, CAPA does not delete it in its ordinary reconciliation, and the instance blocks deletion of the subnet, VPC, etc, and therefore blocks cluster deletion.
Therefore, we could say that the garbage collector should be extended to clean up all AWS resources that would block cluster deletion. I think the garbage collector must be "best effort," because there are edge cases. For example, some AWS resources may be missing the correct tags, and removing others may require different AWS credentials than the ones the garbage collector has.
/milestone v2.3.0
@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.
In response to this:
/milestone v2.3.0
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@richardcase Is this good to go?
/test pull-cluster-api-provider-aws-e2e-eks-gc
Just testing its not a flake:
/test pull-cluster-api-provider-aws-e2e-eks-gc
(could be that #4575 will be needed...i get the failing test fixed on that PR)
/test pull-cluster-api-provider-aws-e2e-eks-gc
Circling back to this now i finally have some time. After doing another review i think we probably need:
- Unit tests covering this new deletion (thanks @AndiDog )
- An aletrnative collection function that will descibe/get the EC2 instances like we do for the other resource types.. See this. Some partitions do not allow the use of the resource tagging api and so a fallback (i.e. alternative) is required.
/milestone v2.4.0
@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cluster-api-provider-aws-build-docker | 7c2d0b524402810e973a57521e36c706809534f2 | link | true | /test pull-cluster-api-provider-aws-build-docker |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.