descheduler nodeFit = false doesn't work as expected with RemovePodsViolatingNodeAffinity

What version of descheduler are you using?

descheduler version: 0.22

Does this issue reproduce with the latest release? Yes

Which descheduler CLI options are you using?

        - "--policy-config-file"
        - "/policy-dir/policy.yaml"
        - "--descheduling-interval"
        - "30s"
        - "--v"
        - "4"

Please provide a copy of your descheduler policy config file

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy-configmap
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemovePodsViolatingNodeAffinity":
        enabled: true
        params:
          nodeAffinityType:
            - "requiredDuringSchedulingIgnoredDuringExecution"
          labelSelector:
            matchExpressions:
              - {key: "foo", operator: In, values: [bar]}
          nodeFit: false

What k8s version are you using (kubectl version)?

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T21:01:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

What did you do?

I created a deployment with a node selector forcing the pods to land on a specific single node. The deployment pod template had a node affinity rule, which was not triggered during scheduling. After scheduling, I changed the node labels so that the affinity should trigger descheduling, and labeled the pods with the required foo=bar label.

What did you expect to see? I expected the descheduler to evict my pod, and the pod to end up in PENDING state as there is no node where it could fit.

What did you see instead? Descheduler prints that the pod will not fit the node, but it won't evict it.

The culprit is at the strategy, node_affinity.go line 89. https://github.com/kubernetes-sigs/descheduler/blob/5b557941fac60cd85c1a3b709eb49a08d648fd8d/pkg/descheduler/strategies/node_affinity.go#L89

The node affinity strategy will always check for finding a node that fits, regardless of how nodeFit is set. The documentation of the nodeFit filtering would suggest that the default is not to check for node fitting: https://github.com/kubernetes-sigs/descheduler/blob/5b557941fac60cd85c1a3b709eb49a08d648fd8d/README.md#L703-L707

It would be logical, that when nodeFit=false, the pod would be evicted regardless of whether it fits somewhere, or not. A one-line change would fix this. Line 89 could be e.g. changed to something like: (!nodeFit || nodeutil.PodFitsAnyNode(pod, nodes))

Oct 05 '21 15:10 uniemimu

/cc @RyanDevlin (if you have the time) The description seems reasonable to me at first glance, do you have any opposition? You know the nodefit code best

Oct 05 '21 16:10 damemi

@damemi Sure I can take a look. If I had to guess, this is due to the fact that all strategies are not aligned to the same NodeFit considerations. PR #636 should fix this, but I just noticed something I forgot to change. Let me report back here in a bit once I have a chance to test something.

Oct 05 '21 16:10 RyanDevlin

/assign

Oct 05 '21 16:10 RyanDevlin

@damemi @uniemimu Reporting back....this is now fully fixed in #636.

Oct 06 '21 13:10 RyanDevlin

@RyanDevlin I verified that your patches fix this

Oct 07 '21 09:10 uniemimu

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 05 '22 09:01 k8s-triage-robot

/remove-lifecycle stale

Jan 05 '22 15:01 uniemimu

Since @RyanDevlin is busy and #636 isn't moving overly fast, may I propose an alternative approach for making sure this nearly 6 months old bug gets fixed.

I push my one-liner fix into a PR, and it is merged by e.g @damemi
Ryan later rebases on top of it, and makes my one-liner go away. I don't mind, and that rebase isn't going to be difficult.

Feb 14 '22 08:02 uniemimu

Hi @uniemimu, I'm very sorry about the delay. Kubernetes and its subprojects (like this one) are maintained by a significant number of volunteers, many of whom review issues and PRs outside of their day jobs. This can lead to some issues being unfortunately neglected for some time. It's always okay in this case to kindly ping someone to check on the status of an issue that is affecting you, otherwise some things (like this one) can get buried under the many other topics that come up. Thank you for understanding this.

Regarding your suggestion, this change may be more complicated than the one-liner you proposed. While it would resolve this issue, it would also change the current default behavior of the NodeAffinity strategy. Since this strategy has been implemented to not evict pods that don't fit on another node by default, users may be operating under that assumption.

This actually brings up a good point for #636, since we default NodeFit to false (in the same idea of preserving default behavior, though with a feature-level implementation like this we can give appropriate notice to users to eventually make this the new default). Defaulting to false in this case would add the same change, by conflating the NodeSelector fit check with the others provided by that feature. I believe this is why the patch worked to solve your problem.

tl;dr is that this change, while technically simple, would break existing user workflows for different use cases that assume the current default. Additionally, I think what you have pointed out raises a bigger concern for the changes in #636 which might also break the default behavior in this strategy (perhaps others too?) @RyanDevlin does this seem right with what you were working on?

Feb 14 '22 17:02 damemi

@damemi I took a look at the legacy here:

https://github.com/kubernetes-sigs/descheduler/commit/40bb490f4cd2f5b1ee218e03c25c6df223ae8313#diff-92b554150104f06be4ac18898266bad157dc2f611a6268d873765bf9608aadb3R56-R60

And indeed you are correct in that the legacy is such that it fits better NodeFit being true considering how things have worked since introducing this particular strategy.

For all the other strategies of that era, and to my knowledge also now, they used to work like the default for NodeFit being false. None of the other strategies checked, whether the pod fits any other node. The difference between the strategies is somewhat unfortunate considering trying to put some default value for the field covering all strategies. I'm no expert on the matter, but the api defaults.go and zz_generated.defaults.go are rather empty, which sort of leaves me assuming the zero value false is the one being defaulted to. Also all strategies now set locally nodeFit variable to false, after which the value is pulled from a snippet such as this: https://github.com/kubernetes-sigs/descheduler/blob/3c251fb09db3d58e10c8290d9510e18f73c3a771/pkg/descheduler/strategies/pod_antiaffinity.go#L74-L77

So I suppose it is safe to say the current default value is false, and how NodeAffinity strategy works is another thing.

Now in order to maintain how things used to work in the original NodeAffinity strategy, and how things have worked in the rest of the strategies, and how the nodeFit feature is documented to be working, I would presume that the correct fix would be to have the default value for NodeAffinity strategy NodeFit to be true and for the rest of strategies false, and then potentially a removal of this check at NodeAffinity: && nodeutil.PodFitsAnyNode(pod, nodes)

Since that check is already in the PodEvictor and controlled by NodeFit

If you need more hands in the project, like for reviews, I would gladly volunteer.

Feb 15 '22 13:02 uniemimu

@damemi @uniemimu I apologize for my absence, I didn't have time to contribute until this week. I have much more time now to focus on closing this issue and merging my PR, if that's still the desired outcome here.

With regard to the defaults discussion above, #636 is going to be in conflict with some strategies no matter how we default it. This is because, like @damemi pointed out, some strategies have a few of the NodeFit checks baked in, while others perform no checks. If we want to align all strategies under the single NodeFit mechanism, we will need to eventually change the behavior of some of them.

One way to do this more gracefully might be to overlap the strategy and NodeFit checks. Since NodeFit considers node affinity before evicting, and the NodeAffinity strategy also considers this, we could merge the checks in node.go of #636, but leave the NodeAffinity strategy as is. Then in a later PR we can remove the checks in NodeAffinity that overlap with NodeFit. This causes some redundancy and inefficiency when running some strategies with NodeFit = true, but it allows for stable behavior that can be gradually changed.

Feb 15 '22 13:02 RyanDevlin

@RyanDevlin at least I'd like to see this bug sorted one way or the other. I like the nodeFit feature and I'd like to be able to use it with the node affinity as it is documented. As to how the defaults should be done, I suppose that is up to @damemi to decide how it needs to be. My take on the matter is in the previous comment, and I don't think the default should be the same for all strategies unless a decision is made that the previous way of working doesn't matter.

Feb 21 '22 11:02 uniemimu

This is a tough one, but it's not just entirely my call 🙂 I agree with @uniemimu that the default doesn't necessarily have to be the same for all strategies. This takes more work on @RyanDevlin's part to refactor his NodeFit changes, but it might be the best way to avoid breaking the existing behavior.

@ingvagabund you've done a lot of the review for NodeFit so far, wdyt?

tl;dr is that NodeFit changes the default affinity checks for the NodeAffinity strategy, because it bakes in an affinity check already. So moving this check to NodeFit will either have to:

remove the affinity check by default for this strategy (if we default to disabling NodeAffinity), or
include the other NodeFit checks by default (if we default to enabling NodeAffinity)

I think the 2nd option is better because while it is a change, it leans toward evicting fewer pods from existing setups rather than more. Thoughts?

Feb 23 '22 19:02 damemi

@damemi I agree that the least disruptive way to add this feature would be to enable it by default. This way (like you said) the worst case scenario would involve evicting less pods than in a previous version.

I still have some cleanup in my PR to meet the changes requested by @ingvagabund, but I'm going to hold off until we agree on a direction here. If we think this is the way to go I will make that PR ready to merge.

Feb 23 '22 20:02 RyanDevlin

@damemi @ingvagabund this seems to have been left without a decision and thus Ryan hasn't been able to work on this.

@RyanDevlin are you still active on those patches of yours or should I just implement a PR for this bug based on e.g. what Mike said on Feb 23 as his preference?

Apr 13 '22 13:04 uniemimu

The 2nd option is more preferable

Apr 28 '22 11:04 ingvagabund

@RyanDevlin very sorry for the delay. I completely missed the last comments. I have rebased your PR and address the review comments in https://github.com/kubernetes-sigs/descheduler/pull/790. Feel free to cherry-pick the changes if you still wanna finish your PR.

Apr 28 '22 11:04 ingvagabund

The 2nd option is more preferable

@ingvagabund please review #793. And perhaps somebody could trigger ok-to-test?

May 06 '22 09:05 uniemimu

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 04 '22 10:08 k8s-triage-robot

Current state:

RemovePodsViolatingNodeAffinity does not fully respect NodeFit enabled/disabled functionality due to the fact there was no shared concept of checking whether a node fits replacement for an evicted pod in the past.
RemovePodsViolatingNodeAffinity has additional PodFitsCurrentNode and PodFitsAnyNode checks before actual eviction which are hard-coded and can not be disabled
Requested functionality is to allow eviction of a pod even though there's no node feasiable for scheduling its replacement

Framework current direction: Once the evictor filter gets migrated into a DefaultEvictor plugin, one of the goals is to split it into filter and preEvictionFilter extension points. The filter extension point will correspond to the current evictor filter. Just without the current NodeFit function. The NodeFit function gets turned into preEvictionFilter extension point with its own arguments. Depending on the discussion we will have during the review, we can identify arguments like:

minimalNodeFit: apply only PodMatchNodeSelector, TolerationsTolerateTaints and IsNodeUnschedulable checks
resourceFit: apply fitsRequest (when disabled a new pending pod can trigger a preemption)
currentNodeFit: when set to false, exclude the pod's current node from node feasibility checks

In order to implement the requested functionality of allowing eviction of pods without checking for node-feasibility will be equivalent to disabling the preEvictionFilter extension point of DefaultEvictor plugin. Effectively bypassing any non-essential checks. Open to debate whether we still need to run the filter extension point (or its essential subset) right before the actual eviction is performed to make sure a plugin does not evict a system critical pod by accident (unless overruled).

For backward compatibility every v1alpha1 strategy configuration will be converted in a separate v1alpha2 profile internally. In case of RemovePodsViolatingNodeAffinity strategy/plugin, the backward compatible setting will correspond to:

when NodeFit=false (PodFitsAnyNode + !PodFitsCurrentNode) or NodeFit=true (PodFitsAnyOtherNode + PodFitsAnyNode + !PodFitsCurrentNode = PodFitsAnyNode + !PodFitsCurrentNode) (please double check :)):

apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: RemovePodsViolatingNodeAffinity
  pluginConfig:
  - name: DefaultEvictor
    args:
      ...
      currentNodeFit: false
  - name: RemovePodsViolatingNodeAffinity
    args:
      ...
  plugins:
    deschedule:
      enabled:
      - RemovePodsViolatingNodeAffinity

In case of other strategies the defaults will be:

for PodLifeTime:

apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: PodLifeTime
  pluginConfig:
  - name: DefaultEvictor
    args:
      ...
      currentNodeFit: true # all nodes are feasible for rescheduling
  - name: PodLifeTime
    args:
      ...
  plugins:
    deschedule:
      enabled:
      - PodLifeTime

for RemoveFailedPods:

apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: RemoveFailedPods
  pluginConfig:
  - name: DefaultEvictor
    args:
      ...
      currentNodeFit: true # all nodes are feasible for rescheduling
  - name: RemoveFailedPods
    args:
      ...
  plugins:
    deschedule:
      enabled:
      - RemoveFailedPods

Aug 16 '22 11:08 ingvagabund

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 15 '22 11:09 k8s-triage-robot

/remove-lifecycle rotten

Sep 15 '22 14:09 knelasevero

I am starting to work on the preEvictionFilter extension point, do we want to discuss this :point_right: Depending on the discussion we will have during the review, we can identify arguments like: in the PR review, or here in this issue? I think it would make sense to make some decision here first, but I guess I could come up with something and check with you all what you think.

cc @damemi @ingvagabund @uniemimu

Sep 15 '22 14:09 knelasevero

Adding those options to NodeFit sounds good to me, but maybe we should split that into a separate follow-up task after converting the current NodeFit to a plugin.

Then at that point it's just a discussion of a change for a single plugin. I don't think that has any architectural impact on the framework refactor.

Sep 15 '22 14:09 damemi

Makes sense!

Sep 15 '22 14:09 knelasevero

For backward compatibility every v1alpha1 strategy configuration will be converted in a separate v1alpha2 profile internally. In case of RemovePodsViolatingNodeAffinity strategy/plugin, the backward compatible setting will correspond to:

when NodeFit=false (PodFitsAnyNode + !PodFitsCurrentNode) or NodeFit=true (PodFitsAnyOtherNode + PodFitsAnyNode + !PodFitsCurrentNode = PodFitsAnyNode + !PodFitsCurrentNode) (please double check :)):

Looks pretty backwards compatible since NodeFit seems to be working just the same whether it is set to true or false.

The proposed and already merged preEvictionFilter and DefaultEvictor plugin look fine to me, but the key for this bug to be solvable is to be able to take out the checks at https://github.com/kubernetes-sigs/descheduler/blob/da3ebb7293e67a396a9f41547cd246854903a842/pkg/framework/plugins/removepodsviolatingnodeaffinity/node_affinity.go#L93-L94

A NodeFit plugin sounds interesting.

Sep 27 '22 16:09 uniemimu

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 26 '22 17:12 k8s-triage-robot

/remove-lifecycle stale

Jan 04 '23 13:01 ingvagabund

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 04 '23 13:04 k8s-triage-robot

/lifecycle frozen

Apr 04 '23 14:04 knelasevero

descheduler descheduler copied to clipboard

nodeFit = false doesn't work as expected with RemovePodsViolatingNodeAffinity

descheduler
descheduler copied to clipboard