descheduler HighNodeUtilization does nothing when all nodes are underutilized

What version of descheduler are you using?

descheduler version: 0.23

Does this issue reproduce with the latest release? Yes

Which descheduler CLI options are you using? Using helm chart 0.23.1 with these overrides:

deschedulerPolicy:
  strategies:
    HighNodeUtilization:
      enabled: true
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            memory: 20
        numberOfNodes: 0
    LowNodeUtilization:
      enabled: false
schedule: 5 10 * * *

What k8s version are you using (kubectl version)?

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

What did you do?

Installed descheduler. Ran the cron job manually.

What did you expect to see?

Descheduler evict pods from some of the underutilized instances.

What did you see instead?

Descheduler didn't evict any pods. Saw this in the log:

I0214 21:57:50.046412       1 highnodeutilization.go:90] "Criteria for a node below target utilization" CPU=100 Mem=20 Pods=100
I0214 21:57:50.046417       1 highnodeutilization.go:91] "Number of underutilized nodes" totalNumber=4
I0214 21:57:50.046421       1 highnodeutilization.go:102] "All nodes are underutilized, nothing to do here"

These lines of code https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/nodeutilization/highnodeutilization.go#L101-L103 seem to be very similar to code in LowNodeUtilization https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/nodeutilization/lownodeutilization.go#L128-L130, perhaps a stray copy-paste?

Feb 15 '22 00:02 slobo

hi actually LowNodeUtilization and HighNodeUtilization do looks very similar. the most different logic is in isNodeWithLowUtilization(usage)

for your question: "Descheduler didn't evict any pods" i will check now

Feb 15 '22 01:02 JaneLiuL

hi as check code, since all nodes are underutilized, so will not evict any pods. you need to adjust the thresholds. and yet, if all nodes are underutilized, we will do nothing.

do you hope that if all nodes are underutilized, only evict pods on random one nodes?

	if len(sourceNodes) == len(nodes) {
		klog.V(1).InfoS("All nodes are underutilized, nothing to do here")
		return
	}

Feb 15 '22 04:02 JaneLiuL

This seems like a bug to me, probably this line was carried over from LowNodeUtilization where a check like this makes more sense. But the point of HighNodeUtilization is to try to achieve bin-packing. So when all nodes are underutilized, we should definitely be taking some action, the question is what should that be?

do you hope that if all nodes are underutilized, only evict pods on random one nodes?

I think we could take a more opinionated approach than this. Since HighNodeUtilization is specifically trying to evict the least-utilized nodes, that gives us a goal to work toward. Maybe we could sort the nodes by utilization and evict pods from the least-utilized nodes until the higher nodes become full (or at least we assume they'll be full). Similar to the pod topology spread balancing strategy.

Feb 15 '22 12:02 damemi

I agree, I'd expect that the least utilized node get evicted.

This is somewhat tangential, perhaps we could have an option for descheduler to aggressively force "optimal" packing by cordoning off the nodes before evicting pods from them? That way there would be no chance of pods ending back onto low utilization nodes, regardless of the scheduler configuration - i.e. you wouldn't need to setup NodeResourcesFit = MostAllocated to get at least some effect from descheduling.

Feb 15 '22 18:02 slobo

if all nodes are underutilized, sort the nodes by utilization and evict pods on the least-utilized nodes. very good idea~~ could i have a try on this enhancement?

Feb 16 '22 01:02 JaneLiuL

@slobo automatically cordoning the nodes is an interesting idea. I believe cordon drains the nodes with respect to PDBs, which aligns with our use of the eviction API. Though there might be a race condition there between the pods descheduler chooses to evict and the cordoned node draining itself.

I also would not recommend it as a method to bypass a proper scheduler configuration. The descheduler is intended to work together with the scheduler, so we design it with the assumption that all evictions will be mapped to a corresponding re-schedule (if the pod gets recreated). Circumventing that could open up to unpredictable behavior we can't support.

@JaneLiuL sure, feel free to start some work on it. I think we might need to think about how we'll sort the nodes, since there are multiple criteria for underutilization (memory, cpu, pod). Maybe the average of them and sort that way?

I also think we could do more than just sort and evict pods from the least-utilized node. If multiple nodes are underutilized, it could continue evicting while re-calculating (in memory) the usage of the higher nodes. In other words, simulate the evictions and re-scheduling until there are no more underutilized nodes or no more pods eligible for eviction. The goal being to eliminate all underutilized nodes rather than just the lowest.

Feb 16 '22 18:02 damemi

I also would not recommend it as a method to bypass a proper scheduler configuration.

Heh, yeah, sounds like playing with fire for sure.

Would be great to have #668 resolved and docs updated on how to go about the correct scheduler configuration.

Feb 17 '22 20:02 slobo

@slobo yeah good point, bumped that issue because someone had expressed interest in updating the docs (though sometimes these things get buried)

Feb 17 '22 20:02 damemi

Questions to think about:

If all nodes are underutilized how can one decide which nodes should be completely drained and which should stay as potential target nodes for attracting evicted pods?
Maybe we could sort the nodes by utilization and evict pods from the least-utilized nodes until the higher nodes become full (or at least we assume they'll be full)

I wonder how many least-utilized nodes should we evict from, how many higher nodes should we target? (taking into account not just the case where all the nodes are underutilized)

The threshold was introduced as a trivial solution for dividing nodes into two groups. One group of nodes for evicting pods, the other one as potential targets for collocating nodes to achieve the bin-packing (up to the kube-scheduler to decide which nodes are to be targeted). In this case the threshold is too high to evict pods from any node. So you need adaptive approach which will lower the threshold. Something like "if all nodes are underutilized, find the first X% of nodes which have the lowest utilization (wrt. native and extended resources) and dynamically adjust the threshold". The X can be set to 50% or higher based on customer use cases. E.g.:

deschedulerPolicy:
  strategies:
    HighNodeUtilization:
      enabled: true
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            memory: 20
        adaptiveThresholdPercentages: 50 # will apply only when all nodes are underutlized
        numberOfNodes: 0
    LowNodeUtilization:
      enabled: false
schedule: 5 10 * * *

The strategy would then "just" adapt the threshold and run the strategy one more time with the new threshold. With expectation that next time there's gonna be at least one node that is minimally utilized (on the scale [underutilized, utilized, overutilized]).

@slobo I wonder if dynamically adapting the threshold would be something applicable in your use case(s)?

Feb 22 '22 15:02 ingvagabund

I wonder how many least-utilized nodes should we evict from, how many higher nodes should we target? (taking into account not just the case where all the nodes are underutilized)

If the goal of HighNodeUtilization is to bin-pack, then imo the answer is "all the nodes". For example if you have 4 nodes with these utilizations:

10% 20% 30% 40%

Then evicting from the lowest on to the highest would follow these steps:

1. 0% 20% 30% 50%
2. 0% 0% 30% 70%
3. 0% 0% 0% 100%

Since the scheduler doesn't have any configurable threshold for "fully-packed" (ie, it always tries for 100% utilization when configured this way), then staying in sync with the assume re-scheduling strategy would use this assumption that we want as few nodes as possible with 100% and as many as possible with 0%

Feb 22 '22 15:02 damemi

I wonder what's the percentage of fully evictable nodes compared to "fully-packed" nodes in the wild. Given the descheduler can not simulate what's the optimal number of nodes to be left intact, we still need some artificial threshold/way of saying "this is where we stop evicting so we don't over-flood remaining nodes". The current algorithm is too simple and presumably sub-optimal to take into account the assumption.

I wonder if it is worth extending the current implementation towards the assumption or re-implementing it (maybe as a new strategy replacing this one) to evict pods from the lowest to the highest utilized nodes. I have not explored this path yet. Open to discussion.

Feb 22 '22 16:02 ingvagabund

I also would not recommend it as a method to bypass a proper scheduler configuration

My use case is that nodes were properly provisioned at the time of scale-up, but maybe some pods were from jobs of something. Leaving the node over provisioned (under utilized). Thus I want the descheduler to fix that issue that would otherwise go untouched.

Mar 06 '22 17:03 ghostsquad

it seems we need to has some big picture potential to improve the descheduler for this part.

Mar 07 '22 00:03 JaneLiuL

I also would not recommend it as a method to bypass a proper scheduler configuration

My use case is that nodes were properly provisioned at the time of scale-up, but maybe some pods were from jobs of something. Leaving the node over provisioned (under utilized). Thus I want the descheduler to fix that issue that would otherwise go untouched.

That's a good point, and a interesting use case. You're talking about draining certain nodes to deprovision them but still keeping an even spread right?

Have you thought about something like tainting the node, then running descheduler with RemovePodsViolatingNodeTatints to drain it?

Mar 07 '22 15:03 damemi

Have you thought about something like tainting the node, then running descheduler with RemovePodsViolatingNodeTatints to drain it?

Absolutely thought about that. That's why I'm here. 😂 Give me automation, or give me death.

I was hoping that the descheduler could do exactly this operation. Taint/Cordon nodes based on utilization and then drain that node.

Mar 07 '22 16:03 ghostsquad

+1 for the issue. Thanks

@JaneLiuL sure, feel free to start some work on it. I think we might need to think about how we'll sort the nodes, since there are multiple criteria for underutilization (memory, cpu, pod). Maybe the average of them and sort that way?

@JaneLiuL are you working on this issue? If not I can take this up.

Mar 17 '22 00:03 dineshbhor

@dineshbhor sorry for i forget this issue, you can take this :)

Mar 17 '22 02:03 JaneLiuL

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 15 '22 03:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 15 '22 03:07 k8s-triage-robot

/remove-lifecycle rotten

Jul 15 '22 06:07 ghostsquad

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 13 '22 06:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 12 '22 07:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 12 '22 07:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 12 '22 07:12 k8s-ci-robot

So has this been abandoned? I wanted to use descheduler to achieve bin-packing but it fails on the exact same issue as OP presented.

May 12 '23 05:05 ppawlowski

@ppawlowski not abandoned, just not re-opened.

May 12 '23 06:05 ingvagabund

Any updates on this? Just ran into this issue myself when I noticed one of our clusters was not packed as well as our others.

Jul 18 '23 00:07 ayk33

https://github.com/kubernetes-sigs/descheduler/pull/893 was closed. I am not aware of any other effort in this direction. Unless proven otherwise this is ready for taking.

Jul 21 '23 10:07 ingvagabund

Is there any update on this question? We recently encountered the same problem when packing GPU cards. Each node has 8 GPU cards, and most nodes have 4 GPU cards occupied. We hope to improve the utilization of GPU cards through the HighNodeUtilization strategy.

Dec 09 '23 03:12 wangyang0616

Encountering same issue. We would like to job to bin-pack onto node using the HighNodeUtilization strategy.

Feb 29 '24 22:02 rohank07

descheduler descheduler copied to clipboard

HighNodeUtilization does nothing when all nodes are underutilized

descheduler
descheduler copied to clipboard