cluster-api
cluster-api copied to clipboard
Can't delete an invalid cluster
If a Cluster becomes invalid during its lifetime i.e. when validation in the webhook is changed between Cluster API versions, the Cluster can no longer be deleted. This is because the delete flow attempts to patch the cluster before deleting with updated conditions.
Steps to recreate:
- With e.g. CAPI 1.2.4 create a Cluster with more than two CIDR ranges under Cluster
.spec.ClusterNetwork.Pods - Update the controller to the version used in #7420 which adds webhook validation to reject Clusters with more than 2 CIDR ranges specified
- Delete the Cluster
With this flow all of the descendant objects in the Cluster were deleted, but the Cluster itself stayed in 'deleting' indefinitely. It's probably sufficient to help users with this issue, which could be impactful for long-running clusters, by documenting how to resolve this and possibly to better log the error in the Cluster controller.
Found when implementing #7420 /kind bug
This is because the delete flow attempts to patch the cluster before deleting with updated conditions.
I wonder if this can be optimized. I think if we only patch conditions we don't necessarily have to validate the whole Cluster.
Also wondering where exactly it's stuck (aka why do we have to patch conditions at this time, but I guess even just removing the finalizer would be blocked today)
I wonder if this can be optimized. I think if we only patch conditions we don't necessarily have to validate the whole Cluster.
I think this would be a good idea, but you're second point is right that the finalizer is also a blocker (I didn't check the full contents of the patch so I'm not sure). Is there a list of other operations we'd like to not validate for? I can see an argument for setting paused also: i.e. cluster is invalid, and user wants to pause it so that changes to make it valid aren't instantly reconciled.
Not sure about other things, but if there are only changes in status / finalizer / pause I think it's reasonable to not re-validate everything else
/triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/help
@killianmuldoon: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/priority important-soon
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
We should consider how to address this, e.g. by adding specific tests when adding new validations or shift to other strategies like "ratcheting" (validate only things that are actually changed)
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/close
Due to lack of bandwidth.
Could be revisited once we have a concrete use-case in a more recent version.
@chrischdi: Closing this issue.
In response to this:
/close
Due to lack of bandwidth.
Could be revisited once we have a concrete use-case in a more recent version.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.