karpenter Termination leaks pods that tolerate the disruption taint that Karpenter will not evict

Description

Observed Behavior: Karpenter should terminate and evict all pods scheduled to a node when terminating a Node/NodeClaim.

Karpenter skips eviction of pods that will reschedule, meaning that those pods are not evicted and cleaned up prior to instance deletion. After Karpenter successfully sends a delete request for the instance, initiating kubelet shutdown, it will immediately after remove the finalizer, thus deleting the node. This race can result in Kubelet failing to evict daemonsets and failing to deregister itself. These pods are then leaked, relying on garbage collection to clean them up.

Expected Behavior: All pods are deleted and evicted from the node, then the kubelet is able to fully gracefully delete with no leaks.

This can be solved with #621 by adding a NoExecute Taint to nodes, allowing TaintManagerEviction to delete pods. Additionally, Karpenter should wait to remove the node finalizer until the underlying CP instance is Terminated (as opposed to Terminating).

Versions:

Chart Version:
Kubernetes Version (kubectl version):

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Nov 01 '23 19:11 njtran

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

Dec 20 '23 12:12 github-actions[bot]

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 25 '24 05:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 24 '24 06:04 k8s-triage-robot

/assign @jmdeal

May 22 '24 20:05 billrayburn

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 21 '24 21:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 21 '24 21:06 k8s-ci-robot