cluster-api Can't delete an invalid cluster

If a Cluster becomes invalid during its lifetime i.e. when validation in the webhook is changed between Cluster API versions, the Cluster can no longer be deleted. This is because the delete flow attempts to patch the cluster before deleting with updated conditions.

Steps to recreate:

With e.g. CAPI 1.2.4 create a Cluster with more than two CIDR ranges under Cluster .spec.ClusterNetwork.Pods
Update the controller to the version used in #7420 which adds webhook validation to reject Clusters with more than 2 CIDR ranges specified
Delete the Cluster

With this flow all of the descendant objects in the Cluster were deleted, but the Cluster itself stayed in 'deleting' indefinitely. It's probably sufficient to help users with this issue, which could be impactful for long-running clusters, by documenting how to resolve this and possibly to better log the error in the Cluster controller.

Found when implementing #7420 /kind bug

Oct 18 '22 11:10 killianmuldoon

This is because the delete flow attempts to patch the cluster before deleting with updated conditions.

I wonder if this can be optimized. I think if we only patch conditions we don't necessarily have to validate the whole Cluster.

Also wondering where exactly it's stuck (aka why do we have to patch conditions at this time, but I guess even just removing the finalizer would be blocked today)

Oct 18 '22 11:10 sbueringer

I wonder if this can be optimized. I think if we only patch conditions we don't necessarily have to validate the whole Cluster.

I think this would be a good idea, but you're second point is right that the finalizer is also a blocker (I didn't check the full contents of the patch so I'm not sure). Is there a list of other operations we'd like to not validate for? I can see an argument for setting paused also: i.e. cluster is invalid, and user wants to pause it so that changes to make it valid aren't instantly reconciled.

Oct 18 '22 11:10 killianmuldoon

Not sure about other things, but if there are only changes in status / finalizer / pause I think it's reasonable to not re-validate everything else

Oct 18 '22 11:10 sbueringer

/triage accepted

Oct 18 '22 20:10 vincepri

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 16 '23 21:01 k8s-triage-robot

/remove-lifecycle stale

Sep 27 '23 07:09 vaibhav2107

/help

Nov 07 '23 16:11 killianmuldoon

@killianmuldoon: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 07 '23 16:11 k8s-ci-robot

/priority important-soon

Apr 12 '24 13:04 fabriziopandini

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Jul 11 '24 14:07 k8s-triage-robot

We should consider how to address this, e.g. by adding specific tests when adding new validations or shift to other strategies like "ratcheting" (validate only things that are actually changed)

Jul 17 '24 13:07 fabriziopandini

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Oct 15 '24 13:10 k8s-triage-robot

/triage accepted

Oct 16 '24 12:10 sbueringer

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Jan 14 '25 12:01 k8s-triage-robot

/close

Due to lack of bandwidth.

Could be revisited once we have a concrete use-case in a more recent version.

Jan 22 '25 14:01 chrischdi

@chrischdi: Closing this issue.

In response to this:

/close

Due to lack of bandwidth.

Could be revisited once we have a concrete use-case in a more recent version.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 22 '25 14:01 k8s-ci-robot

cluster-api cluster-api copied to clipboard

Can't delete an invalid cluster

Guidelines

cluster-api
cluster-api copied to clipboard