karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

"Do not evict" only for running pods

Open runningman84 opened this issue 1 year ago • 19 comments

Tell us about your request

We configure all of our helm hooks and other important pods with the do not evict flag. Some pods might not move to the running state because of having errors like CreateContainerError, CreateContainerConfigError or others.

Therefore we would like to have a better karpenter behaviour, only consider the do not evict flag if a given pod is in state creating or running...

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Right now a single pod can keep a single node running for unlimited time. After some time this is very costly... Imagine having dozens of nodes in a left behind state.

Are you currently working around this issue?

Manually kill these pods. Which does not work if you have dozens of teams and projects running on a given cluster...

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

runningman84 avatar Mar 06 '23 12:03 runningman84

I think in general it makes sense if the pod is in an errored state and is clearly not running, that disruption like this should be allowed.

One thing I can think of is that this has the potential to race and put clusters in a half-state if they have jobs that are stateful:

  1. Pod errors and enters a Failed state
  2. do-not-evict is ignored and Karpenter begins to execute deprovisioning
  3. Pod starts running
  4. Node is deprovisioned which leaves the job half-done

Any thoughts to add to this @njtran?

jonathan-innis avatar Mar 06 '23 18:03 jonathan-innis

This race conditions would only occur if a given pod is the only reason why the cluster scaled up in the first place.

My observation was that most times a pod kept a node running which was already set to no schedule because it was not needed anymore and the pod would also fit at some other node.

runningman84 avatar Mar 07 '23 16:03 runningman84

This race conditions would only occur if a given pod is the only reason why the cluster scaled up in the first place

Not neccesarily. Consolidation or expiration could occur on a node with do-not-evict pods and there would still be other workloads that would be scheduled on that node.

I think general guidance is that jobs should be able to be disrupted and non-stateful since there's any number of reasons why they could fail or get rescheduled.

In short, I agree that the race is most likely inconsequential.

jonathan-innis avatar Mar 07 '23 19:03 jonathan-innis

There won’t be other new workloads because once karpenter decides to remove it the node is set to no schedule anyway.

runningman84 avatar Mar 07 '23 19:03 runningman84

So I believe that we already check if a pod is terminal (succeeded or failed) when considering the do-not-evict pod.

https://github.com/aws/karpenter-core/blob/main/pkg/controllers/deprovisioning/helpers.go#L328-L334 https://github.com/aws/karpenter-core/blob/main/pkg/utils/pod/scheduling.go#L50-L52

Is there something I'm missing here? are you not seeing this happening in your cluster? In the case of only considering running or creating pods, I'm not sure if we should ignore the do-not-evict in the case of Unknown or Terminating, as those map to many different scenarios as well. If I understand correctly, a pod being terminal (succeeded or failed) should be sufficient.

njtran avatar Mar 07 '23 20:03 njtran

After poking around, I think I understand your request now. Looks like you want to bypass the annotation in the case that any of the containers have a problem that you deem a "terminal" state, resulting in the pod staying in the pending phase. If we wanted to try this, I think we'd have to check all the containers on the pod and allow a set of container statuses to bypass the annotation.

Specifically checking the go struct used for this, it doesn't seem like there's a normal set of enum values for this: https://pkg.go.dev/k8s.io/api/core/v1#ContainerStateWaiting So this means "Waiting" could happen for any number of reasons that may remediate itself, or not, tough to know.

I'd be happy to support this though, if we can construct a specific set of container status reasons that we can deem to satisfy @jonathan-innis's condition.

pod is in an errored state and is clearly not running

njtran avatar Mar 07 '23 21:03 njtran

Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:

  • Failed
  • Succeeded/Completed (this could already be implemented?)

Maybe for pods in a pending state we could ignore them after some "timeout"?

runningman84 avatar Mar 08 '23 14:03 runningman84

Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:

  • Failed
  • Succeeded/Completed (this could already be implemented?)

From the link that @njtran shared, we are considering both of these states when evaluating whether we should consider the do-not-evict annotation. The best guess that I have for what you are hitting here is pods in are assigned to a node but are stuck in a pending state due to some issue that is causing the container to not actually get created and move into a running state.

CreateContainerError, CreateContainerConfigError

Can you check the pod phase when you are seeing these errors on your pods? My assumption is that these pods are moving into a pending/running state and therefore they are considered for the do-not-evict annotation.

Maybe for pods in a pending state we could ignore them after some "timeout"?

This seems like a dangerous position for Karpenter to take. We'd essentially be giving a TTL on pod startup time, which isn't a one-size-fits-all value.

jonathan-innis avatar Mar 13 '23 17:03 jonathan-innis

Yes I would like to ignore pods which have containers wich will never recover from a terminal state but I would also like to simple ignore pods which are in these states:

  • Failed
  • Succeeded/Completed (this could already be implemented?)

From the link that @njtran shared, we are considering both of these states when evaluating whether we should consider the do-not-evict annotation. The best guess that I have for what you are hitting here is pods in are assigned to a node but are stuck in a pending state due to some issue that is causing the container to not actually get created and move into a running state.

That's also my assumption

CreateContainerError, CreateContainerConfigError

Can you check the pod phase when you are seeing these errors on your pods? My assumption is that these pods are moving into a pending/running state and therefore they are considered for the do-not-evict annotation.

The pod is in pending state and the container is in waiting state, here is a similar example: https://www.orchome.com/1938

Maybe for pods in a pending state we could ignore them after some "timeout"?

This seems like a dangerous position for Karpenter to take. We'd essentially be giving a TTL on pod startup time, which isn't a one-size-fits-all value.

Okay, maybe this could be an optional feature

runningman84 avatar Apr 19 '23 08:04 runningman84

The pod is in pending state and the container is in waiting state, here is a similar example

Is it possible to get this at the source and remediate the containers that are hanging in this state so that you don't get hung nodes? Is it possible to create some sort of alerting around how long a Pod sits in a Pending phase so that if one sits there for too long, you get an alarm and then can go to the dev team to ask them to remediate their image/container?

jonathan-innis avatar Apr 19 '23 23:04 jonathan-innis

We have alerting in place, but if you have a lot of teams using the cluster it is a pain to manually take care of these issues.

runningman84 avatar Apr 20 '23 09:04 runningman84

hello folks, I sent out this PR to fix this, please check and help to review: https://github.com/aws/karpenter-core/pull/778

mallow111 avatar Nov 11 '23 00:11 mallow111

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 09 '24 01:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 10 '24 01:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Apr 09 '24 02:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 09 '24 02:04 k8s-ci-robot

/reopen I do not think that this is solved

runningman84 avatar Apr 09 '24 09:04 runningman84

/reopen

runningman84 avatar Apr 11 '24 13:04 runningman84

@runningman84: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 11 '24 13:04 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar May 11 '24 13:05 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 11 '24 13:05 k8s-ci-robot