cluster-api
cluster-api copied to clipboard
✨Introduce NodeDeletionStrategy to allow drain node when deleting cluster
What this PR does / why we need it:
Introduce a nodeDeletionStrategy at cluster level, to allow opt-in node drain during cluster deletion. Keep the default behavior the same as before: delete machine without waiting for drain.
Available option:
- force
- graceful
When implementing this, I noticed existing fields like NodeDrainTimeout at MachinePoolTopology/ControlPlaneTopology, I felt like it might be more suitable to put nodeDeletionStrategy along with NodeDrainTimeout at the same level? But if that's the case, does it make more sense to also put it inside MD/MS/Machine level? And implement the propagation logic as part of https://github.com/kubernetes-sigs/cluster-api/issues/10753?
Open to thoughts
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #https://github.com/kubernetes-sigs/cluster-api/issues/9692
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Will fix the lint tmr
/area machine
The fuzzy conversion related test is failing because this field exist in v1beta2 api, not in v1beta1 api. Should I also add this field to v1beta1?
@lubronzhan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cluster-api-verify-main | 1b057297a5149a377feac0791c8b769f4b88d210 | link | true | /test pull-cluster-api-verify-main |
| pull-cluster-api-test-main | 1b057297a5149a377feac0791c8b769f4b88d210 | link | true | /test pull-cluster-api-test-main |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
A general question, about the expected behaviour.
As far as I understand, when you drain a workload with a corresponding PDBs, drain is blocked until a new pod for your workload comes up somewhere else.
How this is supposed to work when we are deleting the cluster, and more specifically when we will reach a point where there is no place left for relocating workloads? Would this lead to drain (and thus cluster deletion) being stuck?
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
A general question, about the expected behaviour.
As far as I understand, when you drain a workload with a corresponding PDBs, drain is blocked until a new pod for your workload comes up somewhere else.
How this is supposed to work when we are deleting the cluster, and more specifically when we will reach a point where there is no place left for relocating workloads? Would this lead to drain (and thus cluster deletion) being stuck?
Thanks @fabriziopandini
Please correct me if I'm wrong, looking at the code, previously, with force delete node, isDeleteNodeAllowed will be false and skip draining node.
Now if I add this, isNodeDrainAllowed will be called. With PDB, drain will timeout if NodeDrainTimeout is set to non-zero. Then continue the same code path that isNodeDrainAllowed is false, so skip drainning and delete the machine.
So should I mention if user want to set NodeDeletionStrategy to graceful, also remember to set NodeDrainTimeout and NodeVolumeDetachTimeout?
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.