descheduler Descheduler seems to ignore cordoned nodes when nodeFit is enabled

What version of descheduler are you using?

descheduler version: v0.24.1 installed by Helm chart

Does this issue reproduce with the latest release? yes

Please provide a copy of your descheduler policy config file

Helm values.yaml

kind: Deployment
deschedulerPolicy:
  strategies:
    RemovePodsViolatingTopologySpreadConstraint:
      enabled: true
      params:
        includeSoftConstraints: true
        nodeFit: true
    RemoveDuplicates:
      enabled: false
    RemovePodsViolatingNodeTaints:
      enabled: false
    RemovePodsViolatingNodeAffinity:
      enabled: false
    RemovePodsViolatingInterPodAntiAffinity:
      enabled: false
    LowNodeUtilization:
      enabled: false

Full configuration generated by Helm, values.yaml belo

What k8s version are you using (kubectl version)?

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"archive", BuildDate:"2022-05-27T18:33:09Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

What did you do?

Created a cluster with minikube (5 nodes)
Set first node as unschedulable with taint node-role.kubernetes.io/master:NoSchedule
Set node label for 2,3 to zone=primary and 4,5 to zone=backup
Created nginx deployment with topologySpreadConstraints on zone with maxSkew: 1 (yaml in attachment)
Cordoned and drained node 4 and 5

What did you expect to see? Pods drain all to zone=primary nodes and stay there until I uncordon, after which descheduler will do it's thing and move half back to zone=backup

What did you see instead? Pods drain to zone primary, however descheduler seems to not respect all zone=backup nodes being unschedulable and keeps forever killing half of pods on nodes with zone=primary

Attached as nodes.yaml is the output of kubectl get nodes -o yaml Archive.zip

Jun 13 '22 18:06 tippl

Mentioned in slack, but I think the solution may be to either:

change the set of nodes passed to the topology spread strategy, or
add logic to the strategy to ignore tainted nodes in its domain calculation

right now, the strategy just operates on the default set of ReadyNodes, which includes nodes that are tainted with NoSchedule. So even though pods can't be rebalanced onto these nodes, it is still including them in the calculation for imbalanced domains.

imo, this strategy simply shouldn't care about tainted nodes. if a node is tainted with NoSchedule, then there is no effect in trying to balance it as part of the domain.

the argument could be made that the NoSchedule nodes might be oversized and thus need pods to be evicted. But this is more a job for the RemovePodsViolatingNodeTaints strategy, so imo that point is moot

Jun 13 '22 18:06 damemi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 11 '22 18:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 11 '22 19:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 10 '22 19:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 10 '22 19:11 k8s-ci-robot

descheduler descheduler copied to clipboard

Descheduler seems to ignore cordoned nodes when nodeFit is enabled

descheduler
descheduler copied to clipboard