cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
Make the delete reconcile loop more robust to errors
/kind bug
What steps did you take and what happened:
- I created a capa cluster, with an account that was missing a required permission (ELB).
- The controller provisioned parts of the cluster, until it tried to deploy the ELB.
- I tried to delete the cluster.
- The controller then became stuck deleting the cluster, because it lacked the permission again.
- Other cluster components (e.g. VPC) remain deployed and cannot be deleted without manual intervention.
What did you expect to happen: Although this specific issue is comes down to a misconfiguration on my part, it seems like this issue would be there for any type of non-transient error during the cluster deployment.
So, I would expect two things to happen:
- The controller should try to delete all components, regardless of whether some fail to be deleted.
- The controller should not fail trying to delete components that it did not create in the first place.
Environment:
- Cluster-api-provider-aws version: v0.4.3
- Kubernetes version: (use
kubectl version): v1.16.2
If this is an actual issue that is within the scope of capa, I would be happy to contribute a patch myself. 🙂
I think it is probably okay to continue with deletion, skipping over resources that we do not have permissions to delete, assuming that we also attempt to describe the resource first.
It's probably a safe bet that if we lack permissions to describe or delete the resource, then we most likely lacked the permissions to create the resource and the chance of orphaning a resource would be slim to none.
This might get a bit tricky around some of the resources that we manage through transitive dependencies of other resources, so it might require some special handling on a case by case basis.
@randomvariable please add some info on the dependency ordering of AWS components
@randomvariable bump
Trying to de-scope v0.5. Moved to Next.
Definitely next. Quite a bit of refactoring to be done to make this happen.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/lifecycle frozen
/remove-lifecycle frozen
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten