descheduler icon indicating copy to clipboard operation
descheduler copied to clipboard

NodeSelector may cause descheduler to abort when unnecessary

Open damemi opened this issue 4 years ago • 34 comments

What version of descheduler are you using?

descheduler version: latest

Which descheduler CLI options are you using? --node-selector

What did you do? When using --node-selector to restrict descheduler to just 1 node, this section of code prevents the descheduler from running with:

I1210 17:32:36.835916       1 descheduler.go:107] "The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."

What did you expect to see? Even if the cluster size is actually larger than 1, but we only want to restrict evictions to 1 node (such as only preventing duplicates from existing on 1 node) this failure still occurs. This seems to be due to nodeSelector being passed to the original node list that is used in this check.

What did you see instead? I should still be able to run descheduler against a single node out of my cluster, because I may have other nodes available for evicted pods that will not cause degradation

damemi avatar Dec 10 '20 17:12 damemi

Yes, descheduler should work normally if we got more than one node in cluster available even if we only selected 1 node. I can help fix this, and will submit PR recently. /assign

lixiang233 avatar Dec 11 '20 07:12 lixiang233

node selector above is working as expected as it was initially designed to select smaller set of nodes from all nodes, and that means only those nodes are the intended nodes not only for eviction but also for where evicted pods will land. So when node selector selects only 1 node, that is like having 1-node cluster, and thats why it does not process further.

To give some context, Descheduler was fundamentally designed to balance nodes in an imbalanced cluster, for example in case of LowNodeUtilization, it computes resources from all selected nodes, most likely all nodes, or could be smaller set of nodes as selected by node selector. This functionality with node selector was needed in multi-tenant scenarios for OpenShift.

If there is need to change the behavior of node selector, i.e., for having different behavior for different strategies, that is fine, but looks like the above change could break LowNodeUtilization strategy.

aveshagarwal avatar Dec 11 '20 14:12 aveshagarwal

@aveshagarwal that is a good point, that having a node selector over only 1 node would hurt the performance of LowNodeUtilization specifically. However, I believe the error that's logged should be updated to reflect that, because running other basic strategies against only 1 node in the cluster will not inherently cause service disruption if there are other nodes in the cluster.

Perhaps this error could be updated to a warning, and disable the LowNodeUtilization strategy with a clearer explanation why. It's an edge case for sure, but I think it is valid for users who know what they are doing to want to run certain other strategies against only 1 node.

damemi avatar Dec 11 '20 14:12 damemi

My main concern is understanding and expectation around node selector behavior. It's not just about 1-node cluster which is just a special case. Specifically, the comment in the original description "I should still be able to run Descheduler against a single node out of my cluster, because I may have other nodes available for evicted pods that will not cause degradation" does not provide the right picture.

In general Descheduler works best, when it has whole view of the cluster it is working on that means all nodes. The node selector just causes Descheduler view to a smaller set of nodes as if the selected nodes are a cluster in themselves. That's why the expectation that evicted pods must land on only those nodes part of selected nodes.

When I had implemented node selector feature, the main design expectation for the node selector was that when a node selector is applied to a x-node cluster selecting n nodes (x >= n), it causes Descheduler to view only n nodes and x-n nodes are completely ignored. IOWs, node selector causes Descheduler's view reduced to only those nodes selected by the selector.

The above is true for the right behavior of LowNodeUtilization.

If there is a need to change the node selector behavior for other strategies, that's fine, but it would be better to document the expectations around node selectors and then make required implementation changes.

aveshagarwal avatar Dec 11 '20 22:12 aveshagarwal

I thought that Node Selector is used to select nodes which descheduler can evicts pods from, and it doesn't matter whether these evicted pods will be rescheduled to selected nodes or not, but after @aveshagarwal explained the initial design, it's more likely that we can only reschedule pods to other selected nodes(in descheduler's view, the cluster only contains selected nodes), so descheduler will abort if we only select 1 node.

BTW, if we want to change this design, these strategies should be disabled: LowNodeUtilization, RemoveDuplicates, RemovePodsViolatingTopologySpreadConstraint.

lixiang233 avatar Dec 14 '20 08:12 lixiang233

I think RemoveDuplicates should be fine, because that doesn't require multi-node calculations to be effective. For example if I have nodes A, B, C and only want to keep duplicates off node B then RemoveDuplicates should still work because it only checks for duplicate pods on that node (that's actually the use case that discovered this).

I agree with the other 2 though, because both of those require consideration of all nodes to effectively make eviction decisions.

damemi avatar Dec 14 '20 13:12 damemi

Current behavior of node selector is in line with intended use case for RemoveDuplicates too. Node selector for RemoveDuplicates was not intended to remove from only some specific nodes.

RemoveDuplicates evicts duplicates pods from nodes with the assumption there are other nodes that can schedule evicted pods. This strategy relies on kube scheduler's even spreading functionality too like LowNodeUtilization.

The initial use case for RemoveDuplicates was that if some nodes in a Kube cluster go down either due to maintenance or other failures causing some pods to reschedule on other nodes. However, when the failed or in-maintenance nodes or their replacements come back, Descheduler should help the duplicate pods to even spread again.

So for example, there are 4 nodes (either in whole cluster or selected nodes via node selector), an app that has 4 pods one on each node (only on those 4 nodes via node selector). If one of the nodes goes down, 3 nodes will have one pod on 2 nodes and 2 pods on one node. Now when the failed node comes back, Descheduler will evict the duplicate pod and it will get scheduled to the node (as this node will be part of nodes selected via node selector).

I don't think that Descheduler can provide with very high probability the above behavior when pods are just selected from some specific nodes. It might work sometime and it might not work other times as the evicted pods can get rescheduled to the same node. It could also cause end users to use this strategy erroneously.

@damemi Could you explain your use case here to help us understand why node selector as it is now is not working as expected?

aveshagarwal avatar Dec 14 '20 22:12 aveshagarwal

The use case came from a user who wanted to test the Descheduler on a single node before expanding it to their entire cluster. My confusion was due to the ambiguity of the error message, where it's not clear whether service degradation is referring to degradation of the cluster, or the descheduler (or both?)

What I am proposing is simply that the check for "cluster size" not necessarily be coupled to the node-selector flag, because that is not an accurate representation of the available nodes in the cluster. Descheduler could still consider the actual cluster size for its performance while restricting evictions to those on the selected node, similar to namespace filtering.

There could be use cases for other strategies as well. Say for security reasons you want to strictly enforce pod antiaffinity on a specific node, or that one node is frequently tainted for some reason. Those 2 strategies could make sense to run on a single node because even if there really are no other nodes in the cluster for them, you only care that they are not running on their current node. The code doesn't take any other nodes into consideration for these examples.

If hypotheticals like that are not within the expected behavior of node selector, then that's fine. What's important is that now this discussion is available to more clearly define the behavior of node selector. This could be an opportunity to update our documentation to explain our decisions around this.

@aveshagarwal thank you for shedding some light on the original design intentions. If you feel there's no change to be made here, then I trust your knowledge as the original author :)

damemi avatar Dec 15 '20 16:12 damemi

should still be able to run descheduler against a single node out of my cluster, because I may have other nodes available for evicted pods that will not cause degradation

With --node-selector set, descheduler takes into account and evicts pods only from nodes targeted by the node selector. Though, a pod can get scheduled to nodes outside of the node selector.

node selector above is working as expected as it was initially designed to select smaller set of nodes from all nodes, and that means only those nodes are the intended nodes not only for eviction but also for where evicted pods will land.

I disagree with but also for where evicted pods will land part. Descheduler does not guarantee where a pod gets rescheduled.

If there is need to change the behavior of node selector, i.e., for having different behavior for different strategies, that is fine, but looks like the above change could break LowNodeUtilization strategy.

In the past we discussed a possibility of moving --node-selector option under each strategy as a new param. Descheduler already provides strategies which can run over non-empty set of nodes (including a singleton set). The original design goal to load-balance pods across nodes no longer applies to all strategies we have today in the descheduler.

Also as @damemi already pointed out

My main concern is understanding and expectation around node selector behavior. It's not just about 1-node cluster which is just a special case. Specifically, the comment in the original description "I should still be able to run Descheduler against a single node out of my cluster, because I may have other nodes available for evicted pods that will not cause degradation" does not provide the right picture.

I second what @aveshagarwal wrote. The description of the issue needs to be more detailed.

BTW, if we want to change this design, these strategies should be disabled: LowNodeUtilization, RemoveDuplicates, RemovePodsViolatingTopologySpreadConstraint.

I am against changing any behaviour around --node-selector option. Instead, I suggest to deprecate the option and have each strategy to handle the case. Disabling the mentioned strategies because there's only one node left can be surprising for a user. I'd rather have each strategy ran and have it log a reason why it can not proceed at a moment.

What I am proposing is simply that the check for "cluster size" not necessarily be coupled to the node-selector flag, because that is not an accurate representation of the available nodes in the cluster. Descheduler could still consider the actual cluster size for its performance while restricting evictions to those on the selected node, similar to namespace filtering.

Moving the check to each strategy is an acceptable solution given the argument of not coupling --node-selector with the check.

ingvagabund avatar Jan 04 '21 11:01 ingvagabund

So, based on what was discussed above, we can move the parameter nodeSelector to StrategyParameters and the config file may look like:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        podLifeTime:
          maxPodLifeTimeSeconds: 86400
        nodeSelector: prod=dev

then, each strategy will be able to set and handle this parameter separately.

WDYT @damemi @aveshagarwal @ingvagabund

lixiang233 avatar Jan 20 '21 08:01 lixiang233

Doing that, every strategy will need to have a node informer (one can run nodeInformer.Lister().List(nodeSelector) then).

type strategyFunction func(
	ctx context.Context, client clientset.Interface,
	strategy api.DeschedulerStrategy,
	nodes []*v1.Node,
	podEvictor *evictions.PodEvictor,
)

->

type strategyFunction func(
	ctx context.Context, client clientset.Interface,
	strategy api.DeschedulerStrategy,
	nodeInformer coreinformers.NodeInformer,
	podEvictor *evictions.PodEvictor,
)

Depending on how many nodes there are in a cluster, each strategy will spend more time on listing and filtering nodes.

ingvagabund avatar Jan 20 '21 10:01 ingvagabund

@ingvagabund I'm about to refactor ReadyNodes to return all ready nodes in cluster and descheduler will pass all of them to all strategies, then in every strategy we'll select nodes by its configured node selector, nodes will be listed only once and no strategy will need to have a node informer. Here's a code example:

wait.Until(func() {
	nodes, err := nodeutil.ReadyNodes(ctx, rs.Client, nodeInformer)
	...
	for name, f := range strategyFuncs {
		if strategy := deschedulerPolicy.Strategies[api.StrategyName(name)]; strategy.Enabled {
			f(ctx, rs.Client, strategy, nodes, podEvictor)
		}
	}
}
...
func PodLifeTime() {
	selectedNodes, err := nodeutil.SelectNodes(nodes)
	...
}

Selecting nodes in each strategy will cost more time, but if we want to set node selector separately for each strategy, we must bear the consequences.

lixiang233 avatar Jan 20 '21 12:01 lixiang233

I think how @lixiang233 described it is good, then we can just keep the one nodeinformer

damemi avatar Jan 20 '21 14:01 damemi

@lixiang233 iiuc what you are suggesting is the same as passing shared node informer to every strategy and running nodeInformer.Lister().List(nodeSelector) with subsequent is-node-ready filter.

ingvagabund avatar Jan 21 '21 11:01 ingvagabund

@ingvagabund yes, they're basically the same, what I'm suggesting only simplifies the function signature.

Selecting nodes will take time to traverse all nodes, but given nodes are stored in memory, I think the time it costs is acceptable.

lixiang233 avatar Jan 22 '21 03:01 lixiang233

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Apr 22 '21 04:04 fejta-bot

/remove-lifecycle stale

seanmalloy avatar Apr 23 '21 05:04 seanmalloy

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Jul 22 '21 06:07 fejta-bot

/remove-lifecycle stale

ingvagabund avatar Jul 22 '21 10:07 ingvagabund

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 20 '21 10:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 19 '21 11:11 k8s-triage-robot

/remove-lifecycle rotten

damemi avatar Nov 19 '21 13:11 damemi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 17 '22 13:02 k8s-triage-robot

/remove-lifecycle stale

bwmetcalf avatar Mar 02 '22 03:03 bwmetcalf

We are seeing this issue with the HighNodeUtilization strategy. Our use case is we use a node selector to select only a subset of our nodes; these are the nodes that are spun up by karpenter that we want to drain if possible and allow karpenter to terminate them. Our other nodes not selected by the descheduler are our "base" nodes that are relatively static. It makes sense in this scenario to allow the descheduler to operate against only one node.

bwmetcalf avatar Mar 02 '22 04:03 bwmetcalf

Also, if running as a deployment, this issue causes the descheduler to enter CrashLoopBackOff with the following in the logs. At the very least, why not just emit a message that the descheduler is not going to operate on one node and continue to run?

I0302 13:31:45.482881       1 node.go:46] "Node lister returned empty list, now fetch directly"
I0302 13:31:45.490349       1 descheduler.go:231] "The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."
I0302 13:31:45.490439       1 reflector.go:225] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0302 13:31:45.490480       1 reflector.go:225] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:134
I0302 13:31:45.490531       1 reflector.go:225] Stopping reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:134
panic: close of closed channel

goroutine 77 [running]:
sigs.k8s.io/descheduler/pkg/descheduler.Run.func1()
	/go/src/sigs.k8s.io/descheduler/pkg/descheduler/descheduler.go:76 +0x3c
created by sigs.k8s.io/descheduler/pkg/descheduler.Run
	/go/src/sigs.k8s.io/descheduler/pkg/descheduler/descheduler.go:74 +0x165

bwmetcalf avatar Mar 02 '22 13:03 bwmetcalf

@bwmetcalf could you provide an example of how you're using HighNodeUtilization to drain the single node? (I think I get your use case from the other discussion we had but just want to make sure I understand here)

For the panic, I tried to look at that line of code but it looks like that file has changed. What version of descheduler are you running? I agree that it shouldn't just panic and I don't think it did before, so that's something we should address if so

damemi avatar Mar 02 '22 17:03 damemi

Hey @damemi. Here is a link where I explained our use case in the context of karpenter. Let me know if you need more information. Thanks! https://github.com/aws/karpenter/issues/1433#issuecomment-1057120579

bwmetcalf avatar Mar 02 '22 20:03 bwmetcalf

Just back from vacation and realized I didn't provide an answer to what version we are running. It's 0.23.1, the latest.

bwmetcalf avatar Mar 10 '22 22:03 bwmetcalf

Sorry I've been slow on this. Just checking, you still seeing that panic with the deployment? The panic looks like the one from https://github.com/kubernetes-sigs/descheduler/issues/728, which should have been fixed in v0.23.1 by https://github.com/kubernetes-sigs/descheduler/pull/700. Also, that issue was related more to Job/CronJob than deployments, so it's weird to come up now. Maybe the fix didn't address something correlated with this nodeselector behavior

damemi avatar Mar 24 '22 19:03 damemi