enhancements KEP4328: Affinity Based Eviction

One-line PR description: Introducing node affinity RequiredDuringSchedulingRequiredDuringExecution

Issue link: https://github.com/kubernetes/enhancements/issues/4328

Other comments:

Nov 06 '23 12:11 AxeZhan

/cc

Nov 15 '23 14:11 ffromani

Came back with updated user story and alternatives.

For taint. I think we can safely rule out the idea of using taint for this purpose. Because taint and toleration do not have as many operators as affinity to cover enough scenes.

For descheduler. Well, as a user myself, I hope kubernetes can natively support this feature. For me, this sounds like a basic feature like requiredDuringSchedulingIgnoredDuringExecution and should be part kubernets. Also, I don't want to introduce descheduler into our existing system just because of this feature. Add additional learning costs for devs and ops.

Dec 06 '23 05:12 AxeZhan

I also wonder if sig-scheduling should be the owner of this feature rather than sig-node. Maybe @kerthcet or @Huang-Wei can weigh in here? Tainteviction is owned by sig-scheduling.

Before, my tendency was to put this as a manager under node-lifecycle-controller https://github.com/kubernetes/kubernetes/pull/121798, and thus should be owned by sig-node. However, I have some new thoughts about this now.

For owner issues. Since this controller is very similar to taint-eviction which is owned by sig-scheduling, and I think this feature will mostly be used for scheduling purpose. I also think sig-scheduling should be the owner of this feature. Although place this manager under the node-lifecycle-controller can reduce some repetitive code, using a new controller like taint-eviction can make it easy to improve affinity-related scheduling or build custom implementations of the affinity based eviction.(For example, if we are going to implement requiredDuringSchedulingRequiredDuringExecution podAffinity/podAntiAffinity, we can reuse this new affinity-controller).

For this, maybe @alculquicondor and @Huang-Wei can weigh in here?

Dec 06 '23 05:12 AxeZhan

Since there are different opinions about which sig should own this KEP, it is difficult to continue work on this KEP. I think there are now two possible routes to continue the work of this kep:

Let sig-scheduling be the owner: The implementation will be changed to add a new controller, let's call it AffinityEvictionController.
- pros: Makes it easier to improve this feature. Easier to implement other affinty related eviction logic in the furture.
- cons: Adding a new controller will increase the complexity of kubernetes.
Let sig-node be the owner: The implementation will be Add a manager under the node-life-cycle contoller just for nodeAffinity requiredDuringSchedulingRequiredDuringExecution(same as now).
- pros: Placing a manager under node-lifycycle-controller can reduce code duplication.
- cons: If we are implementing other affinity related logics(like same feature for podAffinity), we still need to add a new controller, and this will become hard to maintain by then.

I prefer the first route, because I think we can implement requiredDuringSchedulingRequiredDuringExecution not only for nodeAffinity, but also podAffinity and podAntiAffinity in this new controller in the next releases.

For this, I need opinions from sig-scheduling-leads. /cc @alculquicondor @Huang-Wei @ahg-g

Dec 07 '23 16:12 AxeZhan

@ffromani @kannon92 FYI, I added you as this kep's reviewers. Thanks for your reviewing❤️.

Dec 13 '23 15:12 AxeZhan

/cc @mrunalp @derekwaynecarr Can you take a look? Or assign a reviewer from sig-node.

Dec 20 '23 13:12 AxeZhan

/cc @mrunalp @derekwaynecarr Can you take a look? Or assign a reviewer from sig-node.

Both @ffromani and I are on the reviewer list and we focus on sig-node mostly.

Dec 20 '23 19:12 kannon92

Both @ffromani and I are on the reviewer list and we focus on sig-node mostly.

That would be great! I thought it must be someone from kubernetes/sig-node-feature-requests team before😂.

Dec 21 '23 02:12 AxeZhan

@alculquicondor What can I do now to further drive the progress of this enhancement?

Jan 16 '24 02:01 AxeZhan

/cc @dchen1107 Just a kind remind as deadline is approaching.

Jan 30 '24 01:01 AxeZhan

https://github.com/kubernetes/enhancements/pull/4329#discussion_r1455904102 Need a review from a sig-node approver :) /cc @dchen1107 @SergeyKanzhelev @mrunalp

Feb 01 '24 15:02 AxeZhan

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AxeZhan, mrunalp Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign huang-wei for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 02 '24 10:02 k8s-ci-robot

How to handle the failures when deleting pods, will we re-trigger the actions? Maybe the same pattern as node-taint eviction

We'll retry the delete action a few more times, and if still fail to delete pods, we'll leave the pod as it is. The same pattern as taint-eviction

Removing plenty of pods from nodes is risky, maybe we should add toleration seconds(a more fine-gained approach would be combine the nodeLabel update events for the same node)

We'll respect terminationgraceperiodseconds during eviction, and this behavior may helps? Like the comment about PDB, this concern is about risk, and we tend to let users bear it(?) https://github.com/kubernetes/enhancements/pull/4329#discussion_r1425434185 Users should choose this node-affinity when they're sure service will not function on such a node, or deletion will not have unacceptable effects

In a large cluster, will be consider the parallelization? Like a configurable worker number for accelerating.

Yes, we can implement some worker logics for parallelization. But I prefer not to expose configurable configuration to users at least in alpha stage.

Feb 05 '24 14:02 AxeZhan

ping @alculquicondor

Feb 05 '24 14:02 AxeZhan

We'll respect terminationgraceperiodseconds during eviction, and this behavior may helps? Like the comment about PDB, this concern is about risk, and we tend to let users bear it(?)

What I refer to is something similar to the taint.tolerationSeconds, what if someone wrongly labelled the node and changed the label ASAP, but with current implementation, we'll remove the corresponding pods immediately, no toleration.

Feb 06 '24 03:02 kerthcet

What I refer to is something similar to the taint.tolerationSeconds, what if someone wrongly labelled the node and changed the label ASAP, but with current implementation, we'll remove the corresponding pods immediately, no toleration.

Didn't thought about misoperation before. This is a good point, I think we can add this. So, 2 options:

Adding it as part of NodeSelector, this is a little weird as NodeSelector is used many places, but this filed is only used in this controller.

type NodeSelector struct {
	NodeSelectorTerms []NodeSelectorTerm
	// new
	TolerationSeconds *int64
}

Create a new api type.

type RequiredNodeSelector struct {
	NodeSelectorTerms []NodeSelectorTerm
	TolerationSeconds *int64
}

Personally prefer option 2, what do you think? @kerthcet @alculquicondor

Feb 06 '24 08:02 AxeZhan

Personally prefer option 2, what do you think? @kerthcet @alculquicondor

@alculquicondor WDYT of adding a toleration config?

Feb 06 '24 15:02 AxeZhan

@alculquicondor Could you take a look again😢? It's too close to the freeze.

Feb 06 '24 20:02 AxeZhan

Sorry, I have too many KEPs to review. I'll give another pass in a bit.

Feb 06 '24 20:02 alculquicondor

To me, Opt-2, but RequiredSchedulingTerm seems more align with the PreferredSchedulingTerm.

RequiredSchedulingTerm {
	Required NodeSelectorTerm 
	TolerationSeconds *int64
}

Feb 07 '24 03:02 kerthcet

To me, Opt-2, but RequiredSchedulingTerm seems more align with the PreferredSchedulingTerm.

Right, but since RequiredDuringSchedulingIgnoredDuringExecution doesn't need a TolerationSeconds, so maybe RequiredExecutionTerm is more suitable?(just the naming)

RequiredExecutionTerm {
	Required NodeSelectorTerm 
	TolerationSeconds *int64
}

Feb 07 '24 03:02 AxeZhan

From PRR perspective this looks good for Alpha - please ping me when you have SIG level approval and I will approve.

Feb 07 '24 08:02 wojtek-t

@alculquicondor I added tolerationSeconds to the KEP, PTAL.

Also, could we get this merged before freeze for 1.30 and have the discussion about PDB for both taint-eviction and node-affinity-eviction in a follow-up? I don't think there's a way to refactor taint-eviction to let it respect PDB with a rather small engineering cost. I think this refactor itself deserves a KEP. And if we were going to refactor taint-eviction, it will be easy to refactor node-affinity-eviction after that, as these two controllers will share a lot common logics and and have similar codes.

Feb 07 '24 09:02 AxeZhan

Also, could we get this merged before freeze for 1.30 and have the discussion about PDB for both taint-eviction and node-affinity-eviction in a follow-up?

Not really. It's a fundamental difference in the design.

Feb 07 '24 16:02 alculquicondor

I don't think there's a way to refactor taint-eviction to let it respect PDB with a rather small engineering cost.

This KEP shouldn't have to change the way taint-eviction works.

Feb 07 '24 16:02 alculquicondor

This KEP shouldn't have to change the way taint-eviction works.

I think taint-eviction and this kep should have the same pattern, and if we decide to respect PDB in this kep, we should also do it for taint-eviction, right? So yes, we don't have to change taint-eviction in this KEP, but we should change it if we decide to respect PDB.

And I just can't think of a good design to achieve this(respect PDB) https://github.com/kubernetes/enhancements/pull/4329#discussion_r1480557774.

Feb 07 '24 16:02 AxeZhan

I think taint-eviction and this kep should have the same pattern

Not necessarily. It all depends on who this API is for, or when it's used.

In fact, the argument could be made that if you want immediate preemptions, you should put a NoExecute taint on the nodes. But if you want PDB support, you can use RequiredDuringExecution node affinities.

I'd really like the opinion from sig-apps. @atiratree maybe you can give a pass?

Feb 07 '24 17:02 alculquicondor

I will try to visit it today.

Feb 08 '24 14:02 atiratree

@atiratree FYI https://github.com/kubernetes/enhancements/pull/4329#discussion_r1480557774, the main discussion here is how should we handle PDB in this controller, and why taint-eviction doesn't respect PDB now.

Feb 08 '24 14:02 AxeZhan

I also wonder if there is a way to reuse an existing controller or use descheduler for this.

Yeah, I am thinking about https://github.com/kubernetes-sigs/descheduler?tab=readme-ov-file#removepodsviolatingnodeaffinity here.

We get quite a few questions and request about getting some of descheduler functionality into core kubernetes and I wonder why we would make an exception now for this. That new controller with the new affinity type would just be the Descheduler with removepodsviolatingnodeaffinity plugin enabled, except with the descheduler evictions would be for all pods violating the affinities, and for the proposal here it would be only for affinities of this new type (am I correct?)

I concur with @kannon92: existing alternatives and why they are not sufficient to cover this use case should be discussed in more details. Last time I checked, descheduler seem a very nice fit for this use case.

I second that.

For descheduler. Well, as a user myself, I hope kubernetes can natively support this feature. For me, this sounds like a basic feature like requiredDuringSchedulingIgnoredDuringExecution and should be part kubernets. Also, I don't want to introduce descheduler into our existing system just because of this feature. Add additional learning costs for devs and ops.

I understand that, but I feel like this is not enough justification. At least looking at the history of why the descheduler was built as an project outside the core, and how we have been telling that to users with similar requests so far. So my question is what makes this different.

A few concerns with that:

We would have to maintain these two things upstream.
New features or new corner cases getting handled into one would get noticed by users and requested to be implemented in the other, and I guess users would expect parity (and sometimes there will be enough justification to implement the thing in the other)
It would be confusing to users having these similar things being handled by an core controller and another sig project...

What stops us from pushing a lot of other descheduler features into core controllers if we accept this one in? What is the difference?

bringing in @damemi @ingvagabund @a7i

Feb 08 '24 17:02 knelasevero

enhancements enhancements copied to clipboard

KEP4328: Affinity Based Eviction

enhancements
enhancements copied to clipboard