descheduler
descheduler copied to clipboard
HighNodeUtilization does nothing when all nodes are underutilized
What version of descheduler are you using?
descheduler version: 0.23
Does this issue reproduce with the latest release? Yes
Which descheduler CLI options are you using? Using helm chart 0.23.1 with these overrides:
deschedulerPolicy:
strategies:
HighNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
memory: 20
numberOfNodes: 0
LowNodeUtilization:
enabled: false
schedule: 5 10 * * *
What k8s version are you using (kubectl version
)?
kubectl version
Output
$ kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
What did you do?
Installed descheduler. Ran the cron job manually.
What did you expect to see?
Descheduler evict pods from some of the underutilized instances.
What did you see instead?
Descheduler didn't evict any pods. Saw this in the log:
I0214 21:57:50.046412 1 highnodeutilization.go:90] "Criteria for a node below target utilization" CPU=100 Mem=20 Pods=100
I0214 21:57:50.046417 1 highnodeutilization.go:91] "Number of underutilized nodes" totalNumber=4
I0214 21:57:50.046421 1 highnodeutilization.go:102] "All nodes are underutilized, nothing to do here"
These lines of code https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/nodeutilization/highnodeutilization.go#L101-L103 seem to be very similar to code in LowNodeUtilization https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/nodeutilization/lownodeutilization.go#L128-L130, perhaps a stray copy-paste?
hi actually LowNodeUtilization
and HighNodeUtilization
do looks very similar.
the most different logic is in isNodeWithLowUtilization(usage)
for your question: "Descheduler didn't evict any pods" i will check now
hi as check code, since all nodes are underutilized, so will not evict any pods. you need to adjust the thresholds. and yet, if all nodes are underutilized, we will do nothing.
do you hope that if all nodes are underutilized, only evict pods on random one nodes?
if len(sourceNodes) == len(nodes) {
klog.V(1).InfoS("All nodes are underutilized, nothing to do here")
return
}
This seems like a bug to me, probably this line was carried over from LowNodeUtilization where a check like this makes more sense. But the point of HighNodeUtilization is to try to achieve bin-packing. So when all nodes are underutilized, we should definitely be taking some action, the question is what should that be?
do you hope that if all nodes are underutilized, only evict pods on random one nodes?
I think we could take a more opinionated approach than this. Since HighNodeUtilization is specifically trying to evict the least-utilized nodes, that gives us a goal to work toward. Maybe we could sort the nodes by utilization and evict pods from the least-utilized nodes until the higher nodes become full (or at least we assume they'll be full). Similar to the pod topology spread balancing strategy.
I agree, I'd expect that the least utilized node get evicted.
This is somewhat tangential, perhaps we could have an option for descheduler to aggressively force "optimal" packing by cordon
ing off the nodes before evicting pods from them? That way there would be no chance of pods ending back onto low utilization nodes, regardless of the scheduler configuration - i.e. you wouldn't need to setup NodeResourcesFit = MostAllocated to get at least some effect from descheduling.
if all nodes are underutilized, sort the nodes by utilization and evict pods on the least-utilized nodes. very good idea~~ could i have a try on this enhancement?
@slobo automatically cordoning the nodes is an interesting idea. I believe cordon drains the nodes with respect to PDBs, which aligns with our use of the eviction API. Though there might be a race condition there between the pods descheduler chooses to evict and the cordoned node draining itself.
I also would not recommend it as a method to bypass a proper scheduler configuration. The descheduler is intended to work together with the scheduler, so we design it with the assumption that all evictions will be mapped to a corresponding re-schedule (if the pod gets recreated). Circumventing that could open up to unpredictable behavior we can't support.
@JaneLiuL sure, feel free to start some work on it. I think we might need to think about how we'll sort the nodes, since there are multiple criteria for underutilization (memory, cpu, pod). Maybe the average of them and sort that way?
I also think we could do more than just sort and evict pods from the least-utilized node. If multiple nodes are underutilized, it could continue evicting while re-calculating (in memory) the usage of the higher nodes. In other words, simulate the evictions and re-scheduling until there are no more underutilized nodes or no more pods eligible for eviction. The goal being to eliminate all underutilized nodes rather than just the lowest.
I also would not recommend it as a method to bypass a proper scheduler configuration.
Heh, yeah, sounds like playing with fire for sure.
Would be great to have #668 resolved and docs updated on how to go about the correct scheduler configuration.
@slobo yeah good point, bumped that issue because someone had expressed interest in updating the docs (though sometimes these things get buried)
Questions to think about:
-
If all nodes are underutilized how can one decide which nodes should be completely drained and which should stay as potential target nodes for attracting evicted pods?
-
Maybe we could sort the nodes by utilization and evict pods from the least-utilized nodes until the higher nodes become full (or at least we assume they'll be full)
I wonder how many least-utilized nodes should we evict from, how many higher nodes should we target? (taking into account not just the case where all the nodes are underutilized)
The threshold was introduced as a trivial solution for dividing nodes into two groups. One group of nodes for evicting pods, the other one as potential targets for collocating nodes to achieve the bin-packing (up to the kube-scheduler to decide which nodes are to be targeted). In this case the threshold is too high to evict pods from any node. So you need adaptive approach which will lower the threshold. Something like "if all nodes are underutilized, find the first X% of nodes which have the lowest utilization (wrt. native and extended resources) and dynamically adjust the threshold". The X
can be set to 50% or higher based on customer use cases. E.g.:
deschedulerPolicy:
strategies:
HighNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
memory: 20
adaptiveThresholdPercentages: 50 # will apply only when all nodes are underutlized
numberOfNodes: 0
LowNodeUtilization:
enabled: false
schedule: 5 10 * * *
The strategy would then "just" adapt the threshold and run the strategy one more time with the new threshold. With expectation that next time there's gonna be at least one node that is minimally utilized (on the scale [underutilized, utilized, overutilized]).
@slobo I wonder if dynamically adapting the threshold would be something applicable in your use case(s)?
I wonder how many least-utilized nodes should we evict from, how many higher nodes should we target? (taking into account not just the case where all the nodes are underutilized)
If the goal of HighNodeUtilization is to bin-pack, then imo the answer is "all the nodes". For example if you have 4 nodes with these utilizations:
10% 20% 30% 40%
Then evicting from the lowest on to the highest would follow these steps:
1. 0% 20% 30% 50%
2. 0% 0% 30% 70%
3. 0% 0% 0% 100%
Since the scheduler doesn't have any configurable threshold for "fully-packed" (ie, it always tries for 100% utilization when configured this way), then staying in sync with the assume re-scheduling strategy would use this assumption that we want as few nodes as possible with 100% and as many as possible with 0%
I wonder what's the percentage of fully evictable nodes compared to "fully-packed" nodes in the wild. Given the descheduler can not simulate what's the optimal number of nodes to be left intact, we still need some artificial threshold/way of saying "this is where we stop evicting so we don't over-flood remaining nodes". The current algorithm is too simple and presumably sub-optimal to take into account the assumption.
I wonder if it is worth extending the current implementation towards the assumption or re-implementing it (maybe as a new strategy replacing this one) to evict pods from the lowest to the highest utilized nodes. I have not explored this path yet. Open to discussion.
I also would not recommend it as a method to bypass a proper scheduler configuration
My use case is that nodes were properly provisioned at the time of scale-up, but maybe some pods were from jobs of something. Leaving the node over provisioned (under utilized). Thus I want the descheduler to fix that issue that would otherwise go untouched.
it seems we need to has some big picture potential to improve the descheduler for this part.
I also would not recommend it as a method to bypass a proper scheduler configuration
My use case is that nodes were properly provisioned at the time of scale-up, but maybe some pods were from jobs of something. Leaving the node over provisioned (under utilized). Thus I want the descheduler to fix that issue that would otherwise go untouched.
That's a good point, and a interesting use case. You're talking about draining certain nodes to deprovision them but still keeping an even spread right?
Have you thought about something like tainting the node, then running descheduler with RemovePodsViolatingNodeTatints to drain it?
Have you thought about something like tainting the node, then running descheduler with RemovePodsViolatingNodeTatints to drain it?
Absolutely thought about that. That's why I'm here. 😂 Give me automation, or give me death.
I was hoping that the descheduler could do exactly this operation. Taint/Cordon nodes based on utilization and then drain that node.
+1 for the issue. Thanks
@JaneLiuL sure, feel free to start some work on it. I think we might need to think about how we'll sort the nodes, since there are multiple criteria for underutilization (memory, cpu, pod). Maybe the average of them and sort that way?
@JaneLiuL are you working on this issue? If not I can take this up.
@dineshbhor sorry for i forget this issue, you can take this :)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
So has this been abandoned? I wanted to use descheduler to achieve bin-packing but it fails on the exact same issue as OP presented.
@ppawlowski not abandoned, just not re-opened.
Any updates on this? Just ran into this issue myself when I noticed one of our clusters was not packed as well as our others.
https://github.com/kubernetes-sigs/descheduler/pull/893 was closed. I am not aware of any other effort in this direction. Unless proven otherwise this is ready for taking.
Is there any update on this question? We recently encountered the same problem when packing GPU cards. Each node has 8 GPU cards, and most nodes have 4 GPU cards occupied. We hope to improve the utilization of GPU cards through the HighNodeUtilization strategy.
Encountering same issue. We would like to job to bin-pack onto node using the HighNodeUtilization strategy.