cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Make the delete reconcile loop more robust to errors

Open erwinvaneyk opened this issue 6 years ago • 10 comments
trafficstars

/kind bug

What steps did you take and what happened:

  1. I created a capa cluster, with an account that was missing a required permission (ELB).
  2. The controller provisioned parts of the cluster, until it tried to deploy the ELB.
  3. I tried to delete the cluster.
  4. The controller then became stuck deleting the cluster, because it lacked the permission again.
  5. Other cluster components (e.g. VPC) remain deployed and cannot be deleted without manual intervention.

What did you expect to happen: Although this specific issue is comes down to a misconfiguration on my part, it seems like this issue would be there for any type of non-transient error during the cluster deployment.

So, I would expect two things to happen:

  1. The controller should try to delete all components, regardless of whether some fail to be deleted.
  2. The controller should not fail trying to delete components that it did not create in the first place.

Environment:

  • Cluster-api-provider-aws version: v0.4.3
  • Kubernetes version: (use kubectl version): v1.16.2

If this is an actual issue that is within the scope of capa, I would be happy to contribute a patch myself. 🙂

erwinvaneyk avatar Nov 07 '19 09:11 erwinvaneyk

I think it is probably okay to continue with deletion, skipping over resources that we do not have permissions to delete, assuming that we also attempt to describe the resource first.

It's probably a safe bet that if we lack permissions to describe or delete the resource, then we most likely lacked the permissions to create the resource and the chance of orphaning a resource would be slim to none.

This might get a bit tricky around some of the resources that we manage through transitive dependencies of other resources, so it might require some special handling on a case by case basis.

detiber avatar Nov 07 '19 15:11 detiber

@randomvariable please add some info on the dependency ordering of AWS components

ncdc avatar Dec 02 '19 18:12 ncdc

@randomvariable bump

joonas avatar Dec 20 '19 19:12 joonas

Trying to de-scope v0.5. Moved to Next.

ncdc avatar Jan 17 '20 18:01 ncdc

Definitely next. Quite a bit of refactoring to be done to make this happen.

randomvariable avatar Jan 29 '20 15:01 randomvariable

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 28 '20 16:04 fejta-bot

/lifecycle frozen

detiber avatar Apr 28 '20 19:04 detiber

/remove-lifecycle frozen

richardcase avatar Jul 08 '22 22:07 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 06 '22 23:10 k8s-triage-robot

/remove-lifecycle stale

richardcase avatar Oct 10 '22 10:10 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 08 '23 17:02 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 08 '23 17:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 10 '23 18:03 k8s-triage-robot