🐛 Fix non control plane node deletion when the cluster has no control plane machines
/area machine
What this PR does / why we need it:
I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.
While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:
I1205 07:39:46.310195 1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"
this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.
Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.
This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.
Proposed Solution: This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.
@damdo: The label(s) area/bug cannot be applied, because the repository doesn't have them.
In response to this:
/area bug
What this PR does / why we need it:
I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the
Clusterobjectspec.ControlPlaneRefis nil.While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:
I1205 07:39:46.310195 1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.
Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.
This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.
Proposed Solution: This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/cc @fabriziopandini @sbueringer
/test help
@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:
/test pull-cluster-api-build-main/test pull-cluster-api-e2e-blocking-main/test pull-cluster-api-e2e-conformance-ci-latest-main/test pull-cluster-api-e2e-conformance-main/test pull-cluster-api-e2e-latestk8s-main/test pull-cluster-api-e2e-main/test pull-cluster-api-e2e-mink8s-main/test pull-cluster-api-e2e-upgrade-1-31-1-32-main/test pull-cluster-api-test-main/test pull-cluster-api-test-mink8s-main/test pull-cluster-api-verify-main
The following commands are available to trigger optional jobs:
/test pull-cluster-api-apidiff-main
Use /test all to run the following jobs that were automatically triggered:
pull-cluster-api-apidiff-mainpull-cluster-api-build-mainpull-cluster-api-e2e-blocking-mainpull-cluster-api-test-mainpull-cluster-api-verify-main
In response to this:
/test help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main /test pull-cluster-api-e2e-latestk8s-main /test pull-cluster-api-e2e-main /test pull-cluster-api-e2e-mink8s-main /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
It looks like CI is happy
Who should I tag for a review @chrischdi ?
/area machine
/assign @enxebre
Thanks Damiano, change lgtm pending Fabrizio feedback
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Thanks @fabriziopandini and @enxebre for your review. I've addressed your feedback 👍
/test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main /test pull-cluster-api-e2e-latestk8s-main /test pull-cluster-api-e2e-main /test pull-cluster-api-e2e-mink8s-main /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
Stupid question, but does CAPI support setups where we neither have control plane Machines nor a control plane provider?
I thought the case that we don't have a control plane provider is just legacy from v1alpha2
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
I still want to get back to this
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale