karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Termination leaks pods that tolerate the disruption taint that Karpenter will not evict

Open njtran opened this issue 2 years ago • 4 comments

Description

Observed Behavior: Karpenter should terminate and evict all pods scheduled to a node when terminating a Node/NodeClaim.

Karpenter skips eviction of pods that will reschedule, meaning that those pods are not evicted and cleaned up prior to instance deletion. After Karpenter successfully sends a delete request for the instance, initiating kubelet shutdown, it will immediately after remove the finalizer, thus deleting the node. This race can result in Kubelet failing to evict daemonsets and failing to deregister itself. These pods are then leaked, relying on garbage collection to clean them up.

Expected Behavior: All pods are deleted and evicted from the node, then the kubelet is able to fully gracefully delete with no leaks.

This can be solved with #621 by adding a NoExecute Taint to nodes, allowing TaintManagerEviction to delete pods. Additionally, Karpenter should wait to remove the node finalizer until the underlying CP instance is Terminated (as opposed to Terminating).

Versions:

  • Chart Version:
  • Kubernetes Version (kubectl version):
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

njtran avatar Nov 01 '23 19:11 njtran

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

github-actions[bot] avatar Dec 20 '23 12:12 github-actions[bot]

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 25 '24 05:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 24 '24 06:04 k8s-triage-robot

/assign @jmdeal

billrayburn avatar May 22 '24 20:05 billrayburn

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jun 21 '24 21:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 21 '24 21:06 k8s-ci-robot