descheduler
descheduler copied to clipboard
Prefered scheduling
It will be nice if we can have soft check like "preferredDuringSchedulingIgnoredDuringExecution" to calculate if there is a node with better weight than the node where the pod is running to evict the pod and therefore be scheduled in the better node.
For Example because network routes I prefer my deployment PODS run on ZoneA but if not possible because Nodes there are offline then I aceept to be deployed on ZoneB, but if ZoneA is reachable again I want to reschedule back to zoneA.
/kind feature
Apart from preferredDuringSchedulingIgnoredDuringExecution
, it would be nice if soft taints like PreferNoSchedule
could also be taken into account.
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
I tried to work on it but found it was difficult to implement this feature. I believe it is beneficial for anyone who is concerned with this issue to understand the fundamental difficulty.
It is indeed possible to detect a pod having a more preferable node to be scheduled in terms of the sum of affinity weight. However, even if we evict the pod to let kube-scheduler place it in another node with a higher affinity score, sometimes it results in having the new pod in the same node as before. This is because scheduling is decided not only by node affinity or inter-pod affinity. (https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring)
So unless descheduler can make the same decision as kube-scheduler, it can cause this kind of ineffective pod evictions. I have no good idea to overcome this difficulty. Copying all scheduling policies in kube-scheduler to descheduler is not realistic.
As a user of kubernetes, I decided to always use requiredDuringSchedulingIgnoredDuringExecution
. Doing so brought other issues but they are resolvable in my case.
I am looking for the same behavior, although for a different use case:
I need to "redistribute" the pods with preferredDuringSchedulingIgnoredDuringExecution
across the available AZs to ensure High Availability at all times while preserving pod scale-out operations to have more pods than the number of AZs for the cluster (this is because requiredDuringSchedulingIgnoredDuringExecution
will allow a maximum number of pods as the number of AZs the cluster sees).
Unless I am missing a Kubernetes feature which I'm unaware of that supports exactly that...
I am looking for the same behavior, although for a different use case: I need to "redistribute" the pods with
preferredDuringSchedulingIgnoredDuringExecution
across the available AZs to ensure High Availability at all times while preserving pod scale-out operations to have more pods than the number of AZs for the cluster (this is becauserequiredDuringSchedulingIgnoredDuringExecution
will allow a maximum number of pods as the number of AZs the cluster sees).Unless I am missing a Kubernetes feature which I'm unaware of that supports exactly that...
@barucoh take a look at the topologySpreadConstraints feature. This feature is was promoted to beta and is enabled by default starting with k8s v1.18.
This is taken from my other comment on the above closed issue, describing my usecase example
Use case would be where you have 2 autoscaler groups of nodes where one is a spot instance type and the other is standard. The spot nodes get terminated on price going above threshold resulting in the preferredDuringScheduling being invalidated resulting in scheduling onto standard nodes. Over time the price on the spots goes back down and the descheduler can cause the rescheduling of them back onto spot instances.
The issue referenced above does have a PR against it with a potential implementation however @asnkh has an awkward point that in some cases this would possibly result in "flapping" of the redeploys, but maybe this can be dealt with by additional affinities.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
There is some discussion about having the Scheduler have a DryRun capability, which could mean that @asnkh solution doesn't require to import all the scheduling logic into this project. Alas, the issue says that at the moment the way to "check capacity" is using the cluster-capacity tool.
https://github.com/kubernetes/kubernetes/issues/58242 <-- kube-scheduler dry run request https://github.com/kubernetes-sigs/cluster-capacity <-- cluster-capacity tool to check where a pod "might be" scheduled
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
If someone still wants this functionality in a simple quick and dirty way, please check on https://github.com/decayofmind/kube-better-node
/remove-lifecycle rotten
@decayofmind Looks really great! I also have a similar use case. We have SPOT autoscaling groups and I need to be sure that my pods always can be scheduled.
So I cannot use the required node affinity because there can be a case when my spot ASG will be downscaled to 0. And I want to reschedule the pod if the preferred node affinity can be solved.
We also have a similar use case, where we want de-scheduler to work with preferredDuringSchedulingIgnoredDuringExecution because if we use "required" then the no of pods that we could span will be limited to no of Nodes if it's single zone cluster or it will be limited to no of zones if it's a multi zonal cluster
@rajivml have you looked at topology spread constraints? Your situation sounds similar to one that was mentioned above https://github.com/kubernetes-sigs/descheduler/issues/211#issuecomment-608200422
I have the exact same use case as @sergeyshevch, and I don't know of any other way to address this issue.
I'd like to add that for this a 100% solution isn't needed, I'd be happy with 80% as well, and I don't mind that
it can cause this kind of ineffective pod evictions
as @asnkh mentioned earlier. Especially because the RemoveDuplicates
strategy has the same restriction, no? It also can lead to ineffective pod evictions, but it's still a useful strategy to have.
As long as the evictions respect PDB's I don't mind them at all.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
It is indeed possible to detect a pod having a more preferable node to be scheduled in terms of the sum of affinity weight. However, even if we evict the pod to let kube-scheduler place it in another node with a higher affinity score, sometimes it results in having the new pod in the same node as before. This is because scheduling is decided not only by node affinity or inter-pod affinity. (https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring)
So unless descheduler can make the same decision as kube-scheduler, it can cause this kind of ineffective pod evictions. I have no good idea to overcome this difficulty. Copying all scheduling policies in kube-scheduler to descheduler is not realistic.
As I understand, the problem here is that this could result in "flapping" of pods. For some of us, this could be acceptable.
We don't current have a good solution for making the same decision of kube-scheduler (eg cluster-capacity tool, dry run, etc), so what about accepting the limitation and reducing the impact of the flapping? For example, if I could say "don't deschedule a pod with an age of less than 30 minutes", the pod could flap at most once every 30 minutes.
The hope would be, in a dynamic cluster, the pod would eventually be moved as desired. Worst case, you have a flap periodically, and you would understand and accept this if you want to use the feature.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@seanmalloy I guess that such a use case can be implemented later and we should freeze this issue to continue the discussion and next implementations.
Can you freeze it?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale