karpenter
karpenter copied to clipboard
"Do not evict" only for running pods
Tell us about your request
We configure all of our helm hooks and other important pods with the do not evict flag. Some pods might not move to the running state because of having errors like CreateContainerError, CreateContainerConfigError or others.
Therefore we would like to have a better karpenter behaviour, only consider the do not evict flag if a given pod is in state creating or running...
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Right now a single pod can keep a single node running for unlimited time. After some time this is very costly... Imagine having dozens of nodes in a left behind state.
Are you currently working around this issue?
Manually kill these pods. Which does not work if you have dozens of teams and projects running on a given cluster...
Additional Context
No response
Attachments
No response
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
I think in general it makes sense if the pod is in an errored state and is clearly not running, that disruption like this should be allowed.
One thing I can think of is that this has the potential to race and put clusters in a half-state if they have jobs that are stateful:
- Pod errors and enters a Failed state
-
do-not-evict
is ignored and Karpenter begins to execute deprovisioning - Pod starts running
- Node is deprovisioned which leaves the job half-done
Any thoughts to add to this @njtran?
This race conditions would only occur if a given pod is the only reason why the cluster scaled up in the first place.
My observation was that most times a pod kept a node running which was already set to no schedule because it was not needed anymore and the pod would also fit at some other node.
This race conditions would only occur if a given pod is the only reason why the cluster scaled up in the first place
Not neccesarily. Consolidation or expiration could occur on a node with do-not-evict
pods and there would still be other workloads that would be scheduled on that node.
I think general guidance is that jobs should be able to be disrupted and non-stateful since there's any number of reasons why they could fail or get rescheduled.
In short, I agree that the race is most likely inconsequential.
There won’t be other new workloads because once karpenter decides to remove it the node is set to no schedule anyway.
So I believe that we already check if a pod is terminal (succeeded or failed) when considering the do-not-evict pod.
https://github.com/aws/karpenter-core/blob/main/pkg/controllers/deprovisioning/helpers.go#L328-L334 https://github.com/aws/karpenter-core/blob/main/pkg/utils/pod/scheduling.go#L50-L52
Is there something I'm missing here? are you not seeing this happening in your cluster? In the case of only considering running or creating pods, I'm not sure if we should ignore the do-not-evict
in the case of Unknown or Terminating, as those map to many different scenarios as well. If I understand correctly, a pod being terminal (succeeded or failed) should be sufficient.
After poking around, I think I understand your request now. Looks like you want to bypass the annotation in the case that any of the containers have a problem that you deem a "terminal" state, resulting in the pod staying in the pending phase. If we wanted to try this, I think we'd have to check all the containers on the pod and allow a set of container statuses to bypass the annotation.
Specifically checking the go struct used for this, it doesn't seem like there's a normal set of enum values for this: https://pkg.go.dev/k8s.io/api/core/v1#ContainerStateWaiting So this means "Waiting" could happen for any number of reasons that may remediate itself, or not, tough to know.
I'd be happy to support this though, if we can construct a specific set of container status reasons that we can deem to satisfy @jonathan-innis's condition.
pod is in an errored state and is clearly not running
Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:
- Failed
- Succeeded/Completed (this could already be implemented?)
Maybe for pods in a pending state we could ignore them after some "timeout"?
Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:
- Failed
- Succeeded/Completed (this could already be implemented?)
From the link that @njtran shared, we are considering both of these states when evaluating whether we should consider the do-not-evict
annotation. The best guess that I have for what you are hitting here is pods in are assigned to a node but are stuck in a pending state due to some issue that is causing the container to not actually get created and move into a running state.
CreateContainerError, CreateContainerConfigError
Can you check the pod phase when you are seeing these errors on your pods? My assumption is that these pods are moving into a pending/running state and therefore they are considered for the do-not-evict
annotation.
Maybe for pods in a pending state we could ignore them after some "timeout"?
This seems like a dangerous position for Karpenter to take. We'd essentially be giving a TTL on pod startup time, which isn't a one-size-fits-all value.
Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:
- Failed
- Succeeded/Completed (this could already be implemented?)
From the link that @njtran shared, we are considering both of these states when evaluating whether we should consider the
do-not-evict
annotation. The best guess that I have for what you are hitting here is pods in are assigned to a node but are stuck in a pending state due to some issue that is causing the container to not actually get created and move into a running state.
That's also my assumption
CreateContainerError, CreateContainerConfigError
Can you check the pod phase when you are seeing these errors on your pods? My assumption is that these pods are moving into a pending/running state and therefore they are considered for the
do-not-evict
annotation.
The pod is in pending state and the container is in waiting state, here is a similar example: https://www.orchome.com/1938
Maybe for pods in a pending state we could ignore them after some "timeout"?
This seems like a dangerous position for Karpenter to take. We'd essentially be giving a TTL on pod startup time, which isn't a one-size-fits-all value.
Okay, maybe this could be an optional feature
The pod is in pending state and the container is in waiting state, here is a similar example
Is it possible to get this at the source and remediate the containers that are hanging in this state so that you don't get hung nodes? Is it possible to create some sort of alerting around how long a Pod sits in a Pending phase so that if one sits there for too long, you get an alarm and then can go to the dev team to ask them to remediate their image/container?
We have alerting in place, but if you have a lot of teams using the cluster it is a pain to manually take care of these issues.
hello folks, I sent out this PR to fix this, please check and help to review: https://github.com/aws/karpenter-core/pull/778
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen I do not think that this is solved
/reopen
@runningman84: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.