cluster-api ✨Introduce NodeDeletionStrategy to allow drain node when deleting cluster

What this PR does / why we need it: Introduce a nodeDeletionStrategy at cluster level, to allow opt-in node drain during cluster deletion. Keep the default behavior the same as before: delete machine without waiting for drain. Available option:

force
graceful

When implementing this, I noticed existing fields like NodeDrainTimeout at MachinePoolTopology/ControlPlaneTopology, I felt like it might be more suitable to put nodeDeletionStrategy along with NodeDrainTimeout at the same level? But if that's the case, does it make more sense to also put it inside MD/MS/Machine level? And implement the propagation logic as part of https://github.com/kubernetes-sigs/cluster-api/issues/10753? Open to thoughts

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #https://github.com/kubernetes-sigs/cluster-api/issues/9692

May 02 '25 00:05 lubronzhan

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 02 '25 00:05 k8s-ci-robot

Will fix the lint tmr

May 02 '25 04:05 lubronzhan

/area machine

May 02 '25 04:05 lubronzhan

The fuzzy conversion related test is failing because this field exist in v1beta2 api, not in v1beta1 api. Should I also add this field to v1beta1?

May 07 '25 06:05 lubronzhan

@lubronzhan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-verify-main	1b057297a5149a377feac0791c8b769f4b88d210	link	true	`/test pull-cluster-api-verify-main`
pull-cluster-api-test-main	1b057297a5149a377feac0791c8b769f4b88d210	link	true	`/test pull-cluster-api-test-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 07 '25 06:05 k8s-ci-robot

A general question, about the expected behaviour.

As far as I understand, when you drain a workload with a corresponding PDBs, drain is blocked until a new pod for your workload comes up somewhere else.

How this is supposed to work when we are deleting the cluster, and more specifically when we will reach a point where there is no place left for relocating workloads? Would this lead to drain (and thus cluster deletion) being stuck?

May 14 '25 14:05 fabriziopandini

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 16 '25 06:05 k8s-ci-robot

A general question, about the expected behaviour.

As far as I understand, when you drain a workload with a corresponding PDBs, drain is blocked until a new pod for your workload comes up somewhere else.

How this is supposed to work when we are deleting the cluster, and more specifically when we will reach a point where there is no place left for relocating workloads? Would this lead to drain (and thus cluster deletion) being stuck?

Thanks @fabriziopandini

Please correct me if I'm wrong, looking at the code, previously, with force delete node, isDeleteNodeAllowed will be false and skip draining node.

Now if I add this, isNodeDrainAllowed will be called. With PDB, drain will timeout if NodeDrainTimeout is set to non-zero. Then continue the same code path that isNodeDrainAllowed is false, so skip drainning and delete the machine.

So should I mention if user want to set NodeDeletionStrategy to graceful, also remember to set NodeDrainTimeout and NodeVolumeDetachTimeout?

May 28 '25 18:05 lubronzhan

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 26 '25 18:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 25 '25 19:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Oct 25 '25 19:10 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 25 '25 19:10 k8s-ci-robot

cluster-api cluster-api copied to clipboard

✨Introduce NodeDeletionStrategy to allow drain node when deleting cluster

cluster-api
cluster-api copied to clipboard