cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

🐛 Fix non control plane node deletion when the cluster has no control plane machines

Open damdo opened this issue 1 year ago • 21 comments

/area machine

What this PR does / why we need it:

I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.

While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:

I1205 07:39:46.310195       1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"

this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.

Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.

This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.

Proposed Solution: This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.

damdo avatar Dec 10 '24 09:12 damdo

@damdo: The label(s) area/bug cannot be applied, because the repository doesn't have them.

In response to this:

/area bug

What this PR does / why we need it:

I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.

While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:

I1205 07:39:46.310195       1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"

this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.

Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.

This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.

Proposed Solution: This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Dec 10 '24 09:12 k8s-ci-robot

/cc @fabriziopandini @sbueringer

damdo avatar Dec 10 '24 09:12 damdo

/test help

chrischdi avatar Dec 10 '24 09:12 chrischdi

@chrischdi: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-latestk8s-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Dec 10 '24 09:12 k8s-ci-robot

/test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main /test pull-cluster-api-e2e-latestk8s-main /test pull-cluster-api-e2e-main /test pull-cluster-api-e2e-mink8s-main /test pull-cluster-api-e2e-upgrade-1-31-1-32-main

chrischdi avatar Dec 10 '24 09:12 chrischdi

It looks like CI is happy

damdo avatar Dec 10 '24 12:12 damdo

Who should I tag for a review @chrischdi ?

damdo avatar Dec 12 '24 11:12 damdo

/area machine

damdo avatar Dec 16 '24 14:12 damdo

/assign @enxebre

damdo avatar Dec 16 '24 14:12 damdo

Thanks Damiano, change lgtm pending Fabrizio feedback

enxebre avatar Dec 16 '24 17:12 enxebre

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Dec 18 '24 11:12 k8s-ci-robot

Thanks @fabriziopandini and @enxebre for your review. I've addressed your feedback 👍

damdo avatar Dec 18 '24 11:12 damdo

/test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main /test pull-cluster-api-e2e-latestk8s-main /test pull-cluster-api-e2e-main /test pull-cluster-api-e2e-mink8s-main /test pull-cluster-api-e2e-upgrade-1-31-1-32-main

damdo avatar Dec 18 '24 15:12 damdo

Stupid question, but does CAPI support setups where we neither have control plane Machines nor a control plane provider?

I thought the case that we don't have a control plane provider is just legacy from v1alpha2

sbueringer avatar Jan 21 '25 16:01 sbueringer

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 21 '25 17:04 k8s-triage-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 15 '25 01:05 k8s-ci-robot

I still want to get back to this

/remove-lifecycle stale

damdo avatar May 15 '25 08:05 damdo

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 13 '25 08:08 k8s-triage-robot

/remove-lifecycle stale

damdo avatar Aug 13 '25 09:08 damdo

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 11 '25 10:11 k8s-triage-robot