cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Allow garbage collector to delete ec2 instances

Open vincepri opened this issue 2 years ago • 22 comments
trafficstars

What type of PR is this?

What this PR does / why we need it:

/kind feature

Addition to the experimental gc by @richardcase, we should also see if we can scan the CAPA owned tags, to make sure we don't have any leftover. Thoughts?

Release note:

NONE

vincepri avatar Oct 10 '23 23:10 vincepri

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 10 '23 23:10 k8s-ci-robot

/retest

vincepri avatar Oct 11 '23 02:10 vincepri

/test ?

vincepri avatar Oct 11 '23 03:10 vincepri

@vincepri: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 11 '23 03:10 k8s-ci-robot

/test pull-cluster-api-provider-aws-e2e-eks-gc

vincepri avatar Oct 11 '23 03:10 vincepri

Actually seems these tests have been failing for quite some time https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#pr-e2e-eks-gc-main

vincepri avatar Oct 11 '23 13:10 vincepri

FWIW, I've tested the logic in a custom cluster by creating an ec2 instance outside of CAPA, and it deleted properly

vincepri avatar Oct 11 '23 13:10 vincepri

@vincepri - i'm curious in what scenarios there would be an EC2 instance created that isn't managed by CAPA itself (i.e. via Machines/MachinePools)?

The original idea of the GC was for cleaning up resources that where created by the CCM as this can block CAPA deleting a cluster. So, things like an application being deployed that has a service of type load balancer.

richardcase avatar Oct 11 '23 21:10 richardcase

@richardcase A few use cases I had in mind:

  • This would allow to cleanup cluster leftovers; especially if we do expand the filters to include CAPA owned tags.
  • Allows to better support OpenShift, which creates resources using an old version of Cluster API, that tags resources with the cloud provider.

vincepri avatar Oct 11 '23 21:10 vincepri

We've certainly seen users on AWS in the past create EC2 instances by themselves and join them to OpenShift clusters when the tooling within Kube/OpenShift didn't support a feature that they needed. I expect we aren't the only people seeing users do that

Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster? I don't think anything would stop them, and we should expect that this has happened somewhere before, where we have feature gaps

JoelSpeed avatar Oct 12 '23 16:10 JoelSpeed

/milestone v2.3.0

vincepri avatar Oct 16 '23 16:10 vincepri

@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v2.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 16 '23 16:10 k8s-ci-robot

Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster?

If the user created the EC2 instances, I think we can argue that the user is responsible for deleting them.

However, I think we can argue that the purpose of the garbage collector is to unblock cluster deletion. If a user creates an EC2 instance, CAPA does not delete it in its ordinary reconciliation, and the instance blocks deletion of the subnet, VPC, etc, and therefore blocks cluster deletion.

Therefore, we could say that the garbage collector should be extended to clean up all AWS resources that would block cluster deletion. I think the garbage collector must be "best effort," because there are edge cases. For example, some AWS resources may be missing the correct tags, and removing others may require different AWS credentials than the ones the garbage collector has.

dlipovetsky avatar Oct 16 '23 17:10 dlipovetsky

/milestone v2.3.0

vincepri avatar Oct 17 '23 02:10 vincepri

@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v2.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 17 '23 02:10 k8s-ci-robot

@richardcase Is this good to go?

vincepri avatar Oct 23 '23 17:10 vincepri

/test pull-cluster-api-provider-aws-e2e-eks-gc

vincepri avatar Oct 23 '23 18:10 vincepri

Just testing its not a flake:

/test pull-cluster-api-provider-aws-e2e-eks-gc

(could be that #4575 will be needed...i get the failing test fixed on that PR)

richardcase avatar Oct 26 '23 10:10 richardcase

/test pull-cluster-api-provider-aws-e2e-eks-gc

vincepri avatar Nov 13 '23 16:11 vincepri

Circling back to this now i finally have some time. After doing another review i think we probably need:

  • Unit tests covering this new deletion (thanks @AndiDog )
  • An aletrnative collection function that will descibe/get the EC2 instances like we do for the other resource types.. See this. Some partitions do not allow the use of the resource tagging api and so a fallback (i.e. alternative) is required.

richardcase avatar Nov 20 '23 08:11 richardcase

/milestone v2.4.0

richardcase avatar Jan 19 '24 08:01 richardcase

@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-docker 7c2d0b524402810e973a57521e36c706809534f2 link true /test pull-cluster-api-provider-aws-build-docker

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Feb 29 '24 12:02 k8s-ci-robot