descheduler Multiple descheduler strategies of the same type

Is your feature request related to a problem? Please describe.

k8s version: 1.18 descheduler version: 1.18

Currently, we have CronJob resources that are getting stuck due to this k8s issue: https://github.com/kubernetes/kubernetes/issues/52172. We are using managed k8s (EKS), and 1.19 is not available yet. The issue is apparently fixed in 1.19, but we're unable to upgrade.

Long story short, a CronJob pod will spin up, and will immediately get stuck due to the bug described in the linked issue above, and the pod will ultimately end up in Waiting state with the reason being CreateContainerError.

State:          Waiting
  Reason:       CreateContainerError

The only way to address this currently is to manually delete the problematic pods and have k8s re-schedule the CronJob objects during the next scheduled run. Currently, these problematic pods are blocking the CronJob resource from scheduling new pods.

Describe the solution you'd like We'd like to leverage descheduler to handle this case by allowing for multiple PodLifeTime strategies. As I understand it from the documentation, we're only allowed to implement one definition per strategy. We'd like to have a policy that applies only to our CronJob objects (which have a priority class called service-cron), and another policy that applies to all other pods.

For example, the below configuration would be desired:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
       podLifeTime:
         maxPodLifeTimeSeconds: 86400
         podStatusPhases:
         - "Pending"
  "PodLifeTime":
     enabled: true
     params:
       podLifeTime:
        maxPodLifeTimeSeconds: 86400
        podStatusPhases:
         - "Running"
        thresholdPriorityClassName: "service-cron" # this is a custom priority class

If there are other ways to accomplish this, please let me know! I'm open to suggestions.

Describe alternatives you've considered

We've considered upgrading our self-managed clusters to 1.19, but our managed clusters are unable to upgrade since 1.19 isn't available yet. We'd prefer not to run into version drift across our clusters.

What version of descheduler are you using?

descheduler version: 1.18

Additional context

Jan 22 '21 17:01 irmiller22

Hi @irmiller22, thanks for opening this issue. We have discussed similar requests before (I thought there was an open issue, but I couldn't find it), so there is definitely some validity to what you describe.

One option we discussed was something similar to scheduler profiles (https://kubernetes.io/docs/reference/scheduling/config/#multiple-profiles). This would take some minor refactoring but is certainly doable.

Since I couldn't find another issue for this, maybe we can officially open discussion on this here. @ingvagabund @seanmalloy wdyt?

Jan 22 '21 18:01 damemi

We'd like to have a policy that applies only to our CronJob objects (which have a priority class called service-cron)

There's currently no way to separate pods into two distinct groups and have multiple instances of the same strategy to target different group. Specifying thresholdPriorityClassName in the strategy will still evict any pod that has the priority of the class of lower.

It's more better to allow a strategy to target specific group of pods based on a label selector. thresholdPriorityClassName is used for deciding if a pod gets evicted or not.

Nevertheless, +1 for allowing running multiple instances of the same strategy wherever it's applicable. We might start with PodLifeTime though it will require to re-design the policy file. E.g.

apiVersion: "descheduler/v1alpha2" <-- notice the version bump
kind: "DeschedulerPolicy"
strategies:
  - name: PodLifeTime
  - enabled: true
  ...
  - name: PodLifeTime
  - enabled: true
  ...

Jan 25 '21 11:01 ingvagabund

There's currently no way to separate pods into two distinct groups and have multiple instances of the same strategy to target different group

This was another concern I had. I wonder if implementing an annotation like deschedulerName on the pods would work with "multiple" deschedulers or if that's overcomplicating the project

Jan 25 '21 13:01 damemi

annotation like deschedulerName on the pods

As long as it's guided by applications to choose which descheduler is more suitable. Not to run the same descheduler just with a different name and different set of strategies. Imho, it's orthogonal to this issue.

Jan 26 '21 10:01 ingvagabund

I would like to pick this up if its still under consideration. Got some questions

Should the label selector be the differentiator between different instances of the same strategy. Having a default version of the strategy and its variants targeting a specific labeled pods. Pods matching label will execute the labeled strategy while others will use the default. This will make default as must
re-design the policy file - does alpha2 co-exist with alpha 1? Need some help here on how. The apimachinery supports only v1 and v1beta
What are the target strategies for this ?

Apr 20 '21 14:04 bytetwin

I've given my perspective on your questions inline below. Prior to starting implementation we need to have some sort of high level agreement on what exactly the v1alpha2 DeschedulerPolicy will look like. For starters I recommend adding a well thought out comment on this issues with a detailed proposal for v1alpha2. Let's also see what @damemi and @ingvagabund think about this.

Should the label selector be the differentiator between different instances of the same strategy. Having a default version of the strategy and its variants targeting a specific labeled pods. Pods matching label will execute the labeled strategy while others will use the default. This will make default as must

The only other option I can think if using an annotation which was mentioned in https://github.com/kubernetes-sigs/descheduler/issues/486#issuecomment-766830051.

re-design the policy file - does alpha2 co-exist with alpha 1? Need some help here on how. The apimachinery supports only v1 and v1beta

I believe it makes sense to have v1alpha1 one co-exist with v1alpha2.

What are the target strategies for this ?

I think we would want to allow multiple descheduler profiles for all strategies.

Apr 23 '21 05:04 seanmalloy

@hanumanthan thanks for picking this up.

Should the label selector be the differentiator between different instances of the same strategy. Having a default version of the strategy and its variants targeting a specific labeled pods. Pods matching label will execute the labeled strategy while others will use the default. This will make default as must

All strategies up to LowNodeUtilization can already specify a label selector: https://github.com/kubernetes-sigs/descheduler/pull/510. As long as only admin needs to configure the descheduler, this is sufficient.

re-design the policy file - does alpha2 co-exist with alpha 1? Need some help here on how. The apimachinery supports only v1 and v1beta

I believe it makes sense to have v1alpha1 one co-exist with v1alpha2.

Providing multiple instances of the same strategy requires incompatible changes. So we will need to support both versions at the same time for at least 3 releases. Also, I'd like to keep the changes simple so we can simply convert v1alpha1 to v1alpha2.

With that, we might also get rid of enable: true and replace it with disabled: false (which will not be specified by default to reduce the number of params). Not sure what was the historical reason to create the inverted param. @ravisantoshgudimetla do you recall?

What are the target strategies for this ?

In theory all of them. As long as a strategy allows to set a namespace alongside other parameter (e.g. thresholdPriority). Even LowNodeUtilization can be utilized here if we allow to set a node selector to create multiple (and non-overlapping) node pools. On the other hand, it will be up to a user to make sure multiple instances of the same strategy do not interfere with each other where they are not supposed to.

Back to

differentiator between different instances of the same strategy

We can't reasonably check within the descheduler that every configuration is conflict free. Two instances of the same strategy can have non-overlapping groups of pods based on a label selector. Though, there's nothing forbidding to have a group that is targeted by both label selectors. Given this can also change dynamically.

Apr 23 '21 08:04 ingvagabund

As @seanmalloy said, this is a much higher level discussion which will require a design doc for the new version. A new API version is an opportunity for us to resolve other possible improvements we've uncovered as the project as grown (one example is the change I suggested in https://github.com/kubernetes-sigs/descheduler/pull/314).

re-design the policy file - does alpha2 co-exist with alpha 1? Need some help here on how. The apimachinery supports only v1 and v1beta

I believe it makes sense to have v1alpha1 one co-exist with v1alpha2.

Providing multiple instances of the same strategy requires incompatible changes. So we will need to support both versions at the same time for at least 3 releases. Also, I'd like to keep the changes simple so we can simply convert v1alpha1 to v1alpha2.

Technically, an alpha API can be removed at any time, especially since this is just a config API and Descheduler is the only consumer of it.

However as good citizens it would be preferable to support both APIs to allow users time to switch over. This will require us to generate conversion functions between the two (probably with generate-internal-groups.sh). @hanumanthan as a side note here, this versioning is our own and is separate from the versions in apimachinery, which is a different API.

I will start a design document to gather some of the notes that have been mentioned here so far

Apr 23 '21 14:04 damemi

With that, we might also get rid of enable: true and replace it with disabled: false (which will not be specified by default to reduce the number of params). Not sure what was the historical reason to create the inverted param. @ravisantoshgudimetla do you recall?

I don't recall the exact reason but I believe we have done it for the sake of being in sync with featuregate flags which used to have Enabled: true IIRC

Apr 23 '21 20:04 ravisantoshgudimetla

Thanks @damemi I will wait for the design doc to start working on this feature. Happy to be a collaborator for the design though I may have very less inputs for it

Apr 25 '21 23:04 bytetwin

Here is a link to the doc I started: https://docs.google.com/document/d/1S1JCh-0F-QCJvBBG-kbmXiHAJFF8doArhDIAKbOj93I/edit#

Feel free to comment there or propose any alternatives. I think the options are fairly straightforward

Apr 27 '21 15:04 damemi

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jul 27 '21 07:07 k8s-triage-robot

/remove-lifecycle stale

Jul 27 '21 08:07 ingvagabund

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 25 '21 08:10 k8s-triage-robot

/remove-lifecycle stale

Oct 25 '21 16:10 damemi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 23 '22 17:01 k8s-triage-robot

Just adding a voice for support on this issue.. the ability to define different TTLs for pods based on different labels is pretty important to us..

Feb 03 '22 19:02 diranged

/remove-lifecycle stale

Feb 04 '22 22:02 seanmalloy

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 05 '22 23:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 04 '22 23:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Jul 04 '22 23:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 04 '22 23:07 k8s-ci-robot

I'm also keen to see this moving. Just like @diranged , in the company that I work we really wanted to have specific strategies per namespace/label. Without this, it's hard to adopt the descheduler - although it does do what we want it doesn't to the way we wanted.

I initially thought the descheduler would be a fully fledged operator where we could provide CRDs in a distributed way and it would merge/compile the strategies/policies into one config and run it. If that was the case, it would be awesome.

(A question for the core team): Would you be willing to move torwards the "operator with CRDs" target?

Oct 03 '22 22:10 migueleliasweb

Is running multiple instances of kube-descheduler, each with a dedicated policy.yaml viable workaround in the short term? Any perceived gotchas with this approach? More requests on the API server than necessary seems immediate, but I think I'm more worried about correctness.
What is the rationale for adding an array of profiles to a new "resource type" vs. simply allowing multiple policies to co-exist? Unlike scheduling, where you can configure custom schedulers, such a capability does not exist for descheduling so I wonder if patterning the API design in manner inspired by scheduler profiles is necessary.

Nov 16 '22 08:11 jacobstr

Has this been succeeded by https://github.com/kubernetes-sigs/descheduler/issues/926? It appears the original design doc went stale in April 2021 and was replaced by a new document?

If so, might I suggest closing this since in favor of !926? I found it a bit confusing to find the new design.

Nov 16 '22 09:11 jacobstr

Many of the issues were kept open so we can keep track of the original requests.

@damemi would you be ok referring the new design in https://docs.google.com/document/d/1S1JCh-0F-QCJvBBG-kbmXiHAJFF8doArhDIAKbOj93I/edit#?

Nov 16 '22 10:11 ingvagabund

@damemi would you be ok referring the new design in https://docs.google.com/document/d/1S1JCh-0F-QCJvBBG-kbmXiHAJFF8doArhDIAKbOj93I/edit#?

Yup, done

Also, agree with @ingvagabund on keeping this open. This issue is a subset of the overall design of v1alpha2 api in #926

Nov 16 '22 14:11 damemi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 14 '23 14:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 16 '23 15:03 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 19 '23 19:06 k8s-triage-robot

descheduler descheduler copied to clipboard

Multiple descheduler strategies of the same type

descheduler
descheduler copied to clipboard