karpenter
karpenter copied to clipboard
Support Cascade Delete When Removing Karpenter from my Cluster
Description
What problem are you trying to solve?
I'd like to be able to configure cascading delete behavior for Karpenter so that I can set values on NodePool deletion or on CRD deletion that convey to Karpenter that I want a more expedited termination of my nodes rather than waiting for all nodes to fully drain.
Right now, it's possible for nodes to hang due to stuck pods or fully blocking PDBs due to our graceful drain logic. Because a NodePool deletion or CRD deletion causes all the nodes to gracefully drain, it's also possible for these deletion operations to hang, halting the whole process. Ideally, a user could send through something like --grace-period when they are deleting a resource and Karpenter could reason about how to pass that down to all resources that the deletion cascades to.
Minimally, we should allow CRD deletions to get unblocked so that cluster operators can uninstall Karpenter from clusters without being blocked by graceful node drains that may hang.
An initial implementation of this was tried here https://github.com/kubernetes-sigs/karpenter/pull/466 and there was some discussion in the community about enabling the ability to pass gracePeriod through to CRs in the same way that you can pass them through to pods today to affect the deletionTimestamp for a CR, allowing controller authors to build custom logic around this gracePeriod concept.
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
enabling the ability to pass
gracePeriodthrough to CRs in the same way that you can pass them through to pods today to affect thedeletionTimestampfor a CR
Building a coalition of supporters for this idea is effort, but may pay off really well.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/reopen
@jonathan-innis: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/remove-lifecycle rotten
/triage accepted
Discussed this in WG today: The consensus was that folks, in general, still want the ability to have the graceful termination of their nodes -- so they don't want Karpenter to always do a forceful termination of all the nodes on their behalf. There are currently workarounds with the TerminationGracePeriod implementation that would allow users to start the teardown of Karpenter's CRDs, have the NodeClaims start to be terminated, and then have a user or automation annotate all of the nodes with the karpenter.sh/nodeclaim-termination-timestamp to mark the time that the NodeClaim has to be removed by.
In the case that you want forceful termination, you could mark the timestamp to be the current time and then everything should start forcefully removing itself, with the instances that were launched by Karpneter torn down as well.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted