cluster-api-provider-aws Make the delete reconcile loop more robust to errors

trafficstars

/kind bug

What steps did you take and what happened:

I created a capa cluster, with an account that was missing a required permission (ELB).
The controller provisioned parts of the cluster, until it tried to deploy the ELB.
I tried to delete the cluster.
The controller then became stuck deleting the cluster, because it lacked the permission again.
Other cluster components (e.g. VPC) remain deployed and cannot be deleted without manual intervention.

What did you expect to happen: Although this specific issue is comes down to a misconfiguration on my part, it seems like this issue would be there for any type of non-transient error during the cluster deployment.

So, I would expect two things to happen:

The controller should try to delete all components, regardless of whether some fail to be deleted.
The controller should not fail trying to delete components that it did not create in the first place.

Environment:

Cluster-api-provider-aws version: v0.4.3
Kubernetes version: (use kubectl version): v1.16.2

If this is an actual issue that is within the scope of capa, I would be happy to contribute a patch myself. 🙂

Nov 07 '19 09:11 erwinvaneyk

I think it is probably okay to continue with deletion, skipping over resources that we do not have permissions to delete, assuming that we also attempt to describe the resource first.

It's probably a safe bet that if we lack permissions to describe or delete the resource, then we most likely lacked the permissions to create the resource and the chance of orphaning a resource would be slim to none.

This might get a bit tricky around some of the resources that we manage through transitive dependencies of other resources, so it might require some special handling on a case by case basis.

Nov 07 '19 15:11 detiber

@randomvariable please add some info on the dependency ordering of AWS components

Dec 02 '19 18:12 ncdc

@randomvariable bump

Dec 20 '19 19:12 joonas

Trying to de-scope v0.5. Moved to Next.

Jan 17 '20 18:01 ncdc

Definitely next. Quite a bit of refactoring to be done to make this happen.

Jan 29 '20 15:01 randomvariable

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Apr 28 '20 16:04 fejta-bot

/lifecycle frozen

Apr 28 '20 19:04 detiber

/remove-lifecycle frozen

Jul 08 '22 22:07 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 06 '22 23:10 k8s-triage-robot

/remove-lifecycle stale

Oct 10 '22 10:10 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 08 '23 17:02 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 08 '23 17:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 10 '23 18:03 k8s-triage-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Make the delete reconcile loop more robust to errors

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard