descheduler icon indicating copy to clipboard operation
descheduler copied to clipboard

Prefered scheduling

Open segator opened this issue 5 years ago • 26 comments

It will be nice if we can have soft check like "preferredDuringSchedulingIgnoredDuringExecution" to calculate if there is a node with better weight than the node where the pod is running to evict the pod and therefore be scheduled in the better node.

For Example because network routes I prefer my deployment PODS run on ZoneA but if not possible because Nodes there are offline then I aceept to be deployed on ZoneB, but if ZoneA is reachable again I want to reschedule back to zoneA.

segator avatar Jan 03 '20 19:01 segator

/kind feature

seanmalloy avatar Feb 06 '20 03:02 seanmalloy

Apart from preferredDuringSchedulingIgnoredDuringExecution, it would be nice if soft taints like PreferNoSchedule could also be taken into account.

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/

nathan-vp avatar Feb 27 '20 19:02 nathan-vp

I tried to work on it but found it was difficult to implement this feature. I believe it is beneficial for anyone who is concerned with this issue to understand the fundamental difficulty.

It is indeed possible to detect a pod having a more preferable node to be scheduled in terms of the sum of affinity weight. However, even if we evict the pod to let kube-scheduler place it in another node with a higher affinity score, sometimes it results in having the new pod in the same node as before. This is because scheduling is decided not only by node affinity or inter-pod affinity. (https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring)

So unless descheduler can make the same decision as kube-scheduler, it can cause this kind of ineffective pod evictions. I have no good idea to overcome this difficulty. Copying all scheduling policies in kube-scheduler to descheduler is not realistic.

As a user of kubernetes, I decided to always use requiredDuringSchedulingIgnoredDuringExecution. Doing so brought other issues but they are resolvable in my case.

asnkh avatar Mar 21 '20 10:03 asnkh

I am looking for the same behavior, although for a different use case: I need to "redistribute" the pods with preferredDuringSchedulingIgnoredDuringExecution across the available AZs to ensure High Availability at all times while preserving pod scale-out operations to have more pods than the number of AZs for the cluster (this is because requiredDuringSchedulingIgnoredDuringExecution will allow a maximum number of pods as the number of AZs the cluster sees).

Unless I am missing a Kubernetes feature which I'm unaware of that supports exactly that...

barucoh avatar Apr 02 '20 15:04 barucoh

I am looking for the same behavior, although for a different use case: I need to "redistribute" the pods with preferredDuringSchedulingIgnoredDuringExecution across the available AZs to ensure High Availability at all times while preserving pod scale-out operations to have more pods than the number of AZs for the cluster (this is because requiredDuringSchedulingIgnoredDuringExecution will allow a maximum number of pods as the number of AZs the cluster sees).

Unless I am missing a Kubernetes feature which I'm unaware of that supports exactly that...

@barucoh take a look at the topologySpreadConstraints feature. This feature is was promoted to beta and is enabled by default starting with k8s v1.18.

seanmalloy avatar Apr 03 '20 02:04 seanmalloy

This is taken from my other comment on the above closed issue, describing my usecase example

Use case would be where you have 2 autoscaler groups of nodes where one is a spot instance type and the other is standard. The spot nodes get terminated on price going above threshold resulting in the preferredDuringScheduling being invalidated resulting in scheduling onto standard nodes. Over time the price on the spots goes back down and the descheduler can cause the rescheduling of them back onto spot instances.

The issue referenced above does have a PR against it with a potential implementation however @asnkh has an awkward point that in some cases this would possibly result in "flapping" of the redeploys, but maybe this can be dealt with by additional affinities.

yoda avatar Apr 29 '20 05:04 yoda

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jul 28 '20 05:07 fejta-bot

/remove-lifecycle stale

seanmalloy avatar Jul 30 '20 13:07 seanmalloy

There is some discussion about having the Scheduler have a DryRun capability, which could mean that @asnkh solution doesn't require to import all the scheduling logic into this project. Alas, the issue says that at the moment the way to "check capacity" is using the cluster-capacity tool.

https://github.com/kubernetes/kubernetes/issues/58242 <-- kube-scheduler dry run request https://github.com/kubernetes-sigs/cluster-capacity <-- cluster-capacity tool to check where a pod "might be" scheduled

kesor avatar Sep 23 '20 13:09 kesor

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 22 '20 14:12 fejta-bot

/remove-lifecycle stale

seanmalloy avatar Dec 22 '20 15:12 seanmalloy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Mar 22 '21 16:03 fejta-bot

/remove-lifecycle stale

seanmalloy avatar Mar 23 '21 04:03 seanmalloy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Jun 21 '21 05:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar Jul 21 '21 05:07 fejta-bot

If someone still wants this functionality in a simple quick and dirty way, please check on https://github.com/decayofmind/kube-better-node

decayofmind avatar Jul 29 '21 11:07 decayofmind

/remove-lifecycle rotten

seanmalloy avatar Aug 20 '21 06:08 seanmalloy

@decayofmind Looks really great! I also have a similar use case. We have SPOT autoscaling groups and I need to be sure that my pods always can be scheduled.

So I cannot use the required node affinity because there can be a case when my spot ASG will be downscaled to 0. And I want to reschedule the pod if the preferred node affinity can be solved.

sergeyshevch avatar Nov 15 '21 14:11 sergeyshevch

We also have a similar use case, where we want de-scheduler to work with preferredDuringSchedulingIgnoredDuringExecution because if we use "required" then the no of pods that we could span will be limited to no of Nodes if it's single zone cluster or it will be limited to no of zones if it's a multi zonal cluster

rajivml avatar Dec 02 '21 03:12 rajivml

@rajivml have you looked at topology spread constraints? Your situation sounds similar to one that was mentioned above https://github.com/kubernetes-sigs/descheduler/issues/211#issuecomment-608200422

damemi avatar Dec 02 '21 13:12 damemi

I have the exact same use case as @sergeyshevch, and I don't know of any other way to address this issue.

I'd like to add that for this a 100% solution isn't needed, I'd be happy with 80% as well, and I don't mind that

it can cause this kind of ineffective pod evictions

as @asnkh mentioned earlier. Especially because the RemoveDuplicates strategy has the same restriction, no? It also can lead to ineffective pod evictions, but it's still a useful strategy to have.

As long as the evictions respect PDB's I don't mind them at all.

dvdvorle avatar Jan 27 '22 12:01 dvdvorle

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 27 '22 13:04 k8s-triage-robot

/remove-lifecycle stale

sergeyshevch avatar May 06 '22 15:05 sergeyshevch

It is indeed possible to detect a pod having a more preferable node to be scheduled in terms of the sum of affinity weight. However, even if we evict the pod to let kube-scheduler place it in another node with a higher affinity score, sometimes it results in having the new pod in the same node as before. This is because scheduling is decided not only by node affinity or inter-pod affinity. (https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring)

So unless descheduler can make the same decision as kube-scheduler, it can cause this kind of ineffective pod evictions. I have no good idea to overcome this difficulty. Copying all scheduling policies in kube-scheduler to descheduler is not realistic.

As I understand, the problem here is that this could result in "flapping" of pods. For some of us, this could be acceptable.

We don't current have a good solution for making the same decision of kube-scheduler (eg cluster-capacity tool, dry run, etc), so what about accepting the limitation and reducing the impact of the flapping? For example, if I could say "don't deschedule a pod with an age of less than 30 minutes", the pod could flap at most once every 30 minutes.

The hope would be, in a dynamic cluster, the pod would eventually be moved as desired. Worst case, you have a flap periodically, and you would understand and accept this if you want to use the feature.

shaneqld avatar Jun 18 '22 01:06 shaneqld

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 16 '22 02:09 k8s-triage-robot

/remove-lifecycle stale

@seanmalloy I guess that such a use case can be implemented later and we should freeze this issue to continue the discussion and next implementations.

Can you freeze it?

sergeyshevch avatar Sep 16 '22 07:09 sergeyshevch

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 15 '22 07:12 k8s-triage-robot

/remove-lifecycle stale

z0rc avatar Dec 15 '22 08:12 z0rc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 15 '23 09:03 k8s-triage-robot

/remove-lifecycle stale

z0rc avatar Mar 15 '23 10:03 z0rc