cluster-api 🐛 Fix non control plane node deletion when the cluster has no control plane machines

/area machine

What this PR does / why we need it:

I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.

While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:

I1205 07:39:46.310195       1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"

this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.

Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.

This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.

Proposed Solution: This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.

Dec 10 '24 09:12 damdo

@damdo: The label(s) area/bug cannot be applied, because the repository doesn't have them.

In response to this:

/area bug

What this PR does / why we need it:

I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.

While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:
I1205 07:39:46.310195       1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"
this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.

Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.

This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.

Proposed Solution: This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Dec 10 '24 09:12 k8s-ci-robot

/cc @fabriziopandini @sbueringer

Dec 10 '24 09:12 damdo

/test help

Dec 10 '24 09:12 chrischdi

@chrischdi: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Dec 10 '24 09:12 k8s-ci-robot

/test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main /test pull-cluster-api-e2e-latestk8s-main /test pull-cluster-api-e2e-main /test pull-cluster-api-e2e-mink8s-main /test pull-cluster-api-e2e-upgrade-1-31-1-32-main

Dec 10 '24 09:12 chrischdi

It looks like CI is happy

Dec 10 '24 12:12 damdo

Who should I tag for a review @chrischdi ?

Dec 12 '24 11:12 damdo

/area machine

Dec 16 '24 14:12 damdo

/assign @enxebre

Dec 16 '24 14:12 damdo

Thanks Damiano, change lgtm pending Fabrizio feedback

Dec 16 '24 17:12 enxebre

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Dec 18 '24 11:12 k8s-ci-robot

Thanks @fabriziopandini and @enxebre for your review. I've addressed your feedback 👍

Dec 18 '24 11:12 damdo

/test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main /test pull-cluster-api-e2e-latestk8s-main /test pull-cluster-api-e2e-main /test pull-cluster-api-e2e-mink8s-main /test pull-cluster-api-e2e-upgrade-1-31-1-32-main

Dec 18 '24 15:12 damdo

Stupid question, but does CAPI support setups where we neither have control plane Machines nor a control plane provider?

I thought the case that we don't have a control plane provider is just legacy from v1alpha2

Jan 21 '25 16:01 sbueringer

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 21 '25 17:04 k8s-triage-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 15 '25 01:05 k8s-ci-robot

I still want to get back to this

/remove-lifecycle stale

May 15 '25 08:05 damdo

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 13 '25 08:08 k8s-triage-robot

/remove-lifecycle stale

Aug 13 '25 09:08 damdo

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 11 '25 10:11 k8s-triage-robot