cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

✨Introduce NodeDeletionStrategy to allow drain node when deleting cluster

Open lubronzhan opened this issue 7 months ago • 8 comments

What this PR does / why we need it: Introduce a nodeDeletionStrategy at cluster level, to allow opt-in node drain during cluster deletion. Keep the default behavior the same as before: delete machine without waiting for drain. Available option:

  • force
  • graceful

When implementing this, I noticed existing fields like NodeDrainTimeout at MachinePoolTopology/ControlPlaneTopology, I felt like it might be more suitable to put nodeDeletionStrategy along with NodeDrainTimeout at the same level? But if that's the case, does it make more sense to also put it inside MD/MS/Machine level? And implement the propagation logic as part of https://github.com/kubernetes-sigs/cluster-api/issues/10753? Open to thoughts

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #https://github.com/kubernetes-sigs/cluster-api/issues/9692

lubronzhan avatar May 02 '25 00:05 lubronzhan

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 02 '25 00:05 k8s-ci-robot

Will fix the lint tmr

lubronzhan avatar May 02 '25 04:05 lubronzhan

/area machine

lubronzhan avatar May 02 '25 04:05 lubronzhan

The fuzzy conversion related test is failing because this field exist in v1beta2 api, not in v1beta1 api. Should I also add this field to v1beta1?

lubronzhan avatar May 07 '25 06:05 lubronzhan

@lubronzhan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-verify-main 1b057297a5149a377feac0791c8b769f4b88d210 link true /test pull-cluster-api-verify-main
pull-cluster-api-test-main 1b057297a5149a377feac0791c8b769f4b88d210 link true /test pull-cluster-api-test-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar May 07 '25 06:05 k8s-ci-robot

A general question, about the expected behaviour.

As far as I understand, when you drain a workload with a corresponding PDBs, drain is blocked until a new pod for your workload comes up somewhere else.

How this is supposed to work when we are deleting the cluster, and more specifically when we will reach a point where there is no place left for relocating workloads? Would this lead to drain (and thus cluster deletion) being stuck?

fabriziopandini avatar May 14 '25 14:05 fabriziopandini

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 16 '25 06:05 k8s-ci-robot

A general question, about the expected behaviour.

As far as I understand, when you drain a workload with a corresponding PDBs, drain is blocked until a new pod for your workload comes up somewhere else.

How this is supposed to work when we are deleting the cluster, and more specifically when we will reach a point where there is no place left for relocating workloads? Would this lead to drain (and thus cluster deletion) being stuck?

Thanks @fabriziopandini

Please correct me if I'm wrong, looking at the code, previously, with force delete node, isDeleteNodeAllowed will be false and skip draining node.

Now if I add this, isNodeDrainAllowed will be called. With PDB, drain will timeout if NodeDrainTimeout is set to non-zero. Then continue the same code path that isNodeDrainAllowed is false, so skip drainning and delete the machine.

So should I mention if user want to set NodeDeletionStrategy to graceful, also remember to set NodeDrainTimeout and NodeVolumeDetachTimeout?

lubronzhan avatar May 28 '25 18:05 lubronzhan

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 26 '25 18:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 25 '25 19:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Oct 25 '25 19:10 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 25 '25 19:10 k8s-ci-robot