karpenter "Do not evict" only for running pods

Tell us about your request

We configure all of our helm hooks and other important pods with the do not evict flag. Some pods might not move to the running state because of having errors like CreateContainerError, CreateContainerConfigError or others.

Therefore we would like to have a better karpenter behaviour, only consider the do not evict flag if a given pod is in state creating or running...

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Right now a single pod can keep a single node running for unlimited time. After some time this is very costly... Imagine having dozens of nodes in a left behind state.

Are you currently working around this issue?

Manually kill these pods. Which does not work if you have dozens of teams and projects running on a given cluster...

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Mar 06 '23 12:03 runningman84

I think in general it makes sense if the pod is in an errored state and is clearly not running, that disruption like this should be allowed.

One thing I can think of is that this has the potential to race and put clusters in a half-state if they have jobs that are stateful:

Pod errors and enters a Failed state
do-not-evict is ignored and Karpenter begins to execute deprovisioning
Pod starts running
Node is deprovisioned which leaves the job half-done

Any thoughts to add to this @njtran?

Mar 06 '23 18:03 jonathan-innis

This race conditions would only occur if a given pod is the only reason why the cluster scaled up in the first place.

My observation was that most times a pod kept a node running which was already set to no schedule because it was not needed anymore and the pod would also fit at some other node.

Mar 07 '23 16:03 runningman84

This race conditions would only occur if a given pod is the only reason why the cluster scaled up in the first place

Not neccesarily. Consolidation or expiration could occur on a node with do-not-evict pods and there would still be other workloads that would be scheduled on that node.

I think general guidance is that jobs should be able to be disrupted and non-stateful since there's any number of reasons why they could fail or get rescheduled.

In short, I agree that the race is most likely inconsequential.

Mar 07 '23 19:03 jonathan-innis

There won’t be other new workloads because once karpenter decides to remove it the node is set to no schedule anyway.

Mar 07 '23 19:03 runningman84

So I believe that we already check if a pod is terminal (succeeded or failed) when considering the do-not-evict pod.

https://github.com/aws/karpenter-core/blob/main/pkg/controllers/deprovisioning/helpers.go#L328-L334 https://github.com/aws/karpenter-core/blob/main/pkg/utils/pod/scheduling.go#L50-L52

Is there something I'm missing here? are you not seeing this happening in your cluster? In the case of only considering running or creating pods, I'm not sure if we should ignore the do-not-evict in the case of Unknown or Terminating, as those map to many different scenarios as well. If I understand correctly, a pod being terminal (succeeded or failed) should be sufficient.

Mar 07 '23 20:03 njtran

After poking around, I think I understand your request now. Looks like you want to bypass the annotation in the case that any of the containers have a problem that you deem a "terminal" state, resulting in the pod staying in the pending phase. If we wanted to try this, I think we'd have to check all the containers on the pod and allow a set of container statuses to bypass the annotation.

Specifically checking the go struct used for this, it doesn't seem like there's a normal set of enum values for this: https://pkg.go.dev/k8s.io/api/core/v1#ContainerStateWaiting So this means "Waiting" could happen for any number of reasons that may remediate itself, or not, tough to know.

I'd be happy to support this though, if we can construct a specific set of container status reasons that we can deem to satisfy @jonathan-innis's condition.

pod is in an errored state and is clearly not running

Mar 07 '23 21:03 njtran

Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:

Failed
Succeeded/Completed (this could already be implemented?)

Maybe for pods in a pending state we could ignore them after some "timeout"?

Mar 08 '23 14:03 runningman84

Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:

Failed

Succeeded/Completed (this could already be implemented?)

From the link that @njtran shared, we are considering both of these states when evaluating whether we should consider the do-not-evict annotation. The best guess that I have for what you are hitting here is pods in are assigned to a node but are stuck in a pending state due to some issue that is causing the container to not actually get created and move into a running state.

CreateContainerError, CreateContainerConfigError

Can you check the pod phase when you are seeing these errors on your pods? My assumption is that these pods are moving into a pending/running state and therefore they are considered for the do-not-evict annotation.

Maybe for pods in a pending state we could ignore them after some "timeout"?

This seems like a dangerous position for Karpenter to take. We'd essentially be giving a TTL on pod startup time, which isn't a one-size-fits-all value.

Mar 13 '23 17:03 jonathan-innis

Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:

Failed

Succeeded/Completed (this could already be implemented?)

From the link that @njtran shared, we are considering both of these states when evaluating whether we should consider the do-not-evict annotation. The best guess that I have for what you are hitting here is pods in are assigned to a node but are stuck in a pending state due to some issue that is causing the container to not actually get created and move into a running state.

That's also my assumption

CreateContainerError, CreateContainerConfigError

Can you check the pod phase when you are seeing these errors on your pods? My assumption is that these pods are moving into a pending/running state and therefore they are considered for the do-not-evict annotation.

The pod is in pending state and the container is in waiting state, here is a similar example: https://www.orchome.com/1938

Maybe for pods in a pending state we could ignore them after some "timeout"?

This seems like a dangerous position for Karpenter to take. We'd essentially be giving a TTL on pod startup time, which isn't a one-size-fits-all value.

Okay, maybe this could be an optional feature

Apr 19 '23 08:04 runningman84

The pod is in pending state and the container is in waiting state, here is a similar example

Is it possible to get this at the source and remediate the containers that are hanging in this state so that you don't get hung nodes? Is it possible to create some sort of alerting around how long a Pod sits in a Pending phase so that if one sits there for too long, you get an alarm and then can go to the dev team to ask them to remediate their image/container?

Apr 19 '23 23:04 jonathan-innis

We have alerting in place, but if you have a lot of teams using the cluster it is a pain to manually take care of these issues.

Apr 20 '23 09:04 runningman84

hello folks, I sent out this PR to fix this, please check and help to review: https://github.com/aws/karpenter-core/pull/778

Nov 11 '23 00:11 mallow111

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 09 '24 01:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 10 '24 01:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 09 '24 02:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 09 '24 02:04 k8s-ci-robot

/reopen I do not think that this is solved

Apr 09 '24 09:04 runningman84

/reopen

Apr 11 '24 13:04 runningman84

@runningman84: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 13:04 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

May 11 '24 13:05 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 11 '24 13:05 k8s-ci-robot

karpenter karpenter copied to clipboard

"Do not evict" only for running pods

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

karpenter
karpenter copied to clipboard