karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Support Cascade Delete When Removing Karpenter from my Cluster

Open jonathan-innis opened this issue 1 year ago • 10 comments
trafficstars

Description

What problem are you trying to solve?

I'd like to be able to configure cascading delete behavior for Karpenter so that I can set values on NodePool deletion or on CRD deletion that convey to Karpenter that I want a more expedited termination of my nodes rather than waiting for all nodes to fully drain.

Right now, it's possible for nodes to hang due to stuck pods or fully blocking PDBs due to our graceful drain logic. Because a NodePool deletion or CRD deletion causes all the nodes to gracefully drain, it's also possible for these deletion operations to hang, halting the whole process. Ideally, a user could send through something like --grace-period when they are deleting a resource and Karpenter could reason about how to pass that down to all resources that the deletion cascades to.

Minimally, we should allow CRD deletions to get unblocked so that cluster operators can uninstall Karpenter from clusters without being blocked by graceful node drains that may hang.

An initial implementation of this was tried here https://github.com/kubernetes-sigs/karpenter/pull/466 and there was some discussion in the community about enabling the ability to pass gracePeriod through to CRs in the same way that you can pass them through to pods today to affect the deletionTimestamp for a CR, allowing controller authors to build custom logic around this gracePeriod concept.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis avatar Feb 22 '24 07:02 jonathan-innis

enabling the ability to pass gracePeriod through to CRs in the same way that you can pass them through to pods today to affect the deletionTimestamp for a CR

Building a coalition of supporters for this idea is effort, but may pay off really well.

sftim avatar Feb 26 '24 17:02 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 26 '24 17:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 25 '24 17:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 25 '24 18:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 25 '24 18:07 k8s-ci-robot

/reopen

jonathan-innis avatar Aug 01 '24 21:08 jonathan-innis

@jonathan-innis: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 01 '24 21:08 k8s-ci-robot

/remove-lifecycle rotten

jonathan-innis avatar Aug 01 '24 21:08 jonathan-innis

/triage accepted

jonathan-innis avatar Aug 01 '24 21:08 jonathan-innis

Discussed this in WG today: The consensus was that folks, in general, still want the ability to have the graceful termination of their nodes -- so they don't want Karpenter to always do a forceful termination of all the nodes on their behalf. There are currently workarounds with the TerminationGracePeriod implementation that would allow users to start the teardown of Karpenter's CRDs, have the NodeClaims start to be terminated, and then have a user or automation annotate all of the nodes with the karpenter.sh/nodeclaim-termination-timestamp to mark the time that the NodeClaim has to be removed by.

In the case that you want forceful termination, you could mark the timestamp to be the current time and then everything should start forcefully removing itself, with the instances that were launched by Karpneter torn down as well.

jonathan-innis avatar Aug 01 '24 22:08 jonathan-innis

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Aug 01 '25 22:08 k8s-triage-robot

/triage accepted

rschalo avatar Aug 04 '25 22:08 rschalo