kueue Distinguish between preemption within ClusterQueue, reclaim within Cohort and preemption with borrowing

What would you like to be added:

Preemption in Kueue can happen due to the following reasons:

Within a Cluster Queue, due to an incoming Pod with higher priority
Within a Cohort, to reclaim quota
Within a Cohort, due to preemption with borrowing, because the preempted workload is below a threshold.

We need to enhance the Condition reason and messages to incorporate this information.

Why is this needed:

For users to better understand why Kueue preempted their Jobs.

Administrators might want to aggregate information for multiple preemptions, so ideally the Reason should be different, with a common prefix. A potential problem is to break users that already depend on the Preempted reason to identify preemptions. Perhaps highlighting the breaking change should be enough in the release notes.

We can also incorporate the information in a metric label.

Completion requirements:

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[ ] Docs update

The artifacts should be linked in subsequent comments.

Mar 20 '24 20:03 alculquicondor

cc @tenzen-y @astefanutti @kerthcet wdyt of the breaking change?

Mar 20 '24 20:03 alculquicondor

+1 not maintaining backward compatibility is OK on our side.

I think that relates to #1741 for the metrics part.

Mar 21 '24 10:03 astefanutti

Actually, the current state could be considered buggy, because the message in the Evicted condition always says: "Preempted to accommodate a higher priority Workload".

So at the very least we can start by distinguishing the message.

Mar 21 '24 16:03 alculquicondor

@alculquicondor, I agree with making UX better. And I'm ok with these breaking changes. I my side, we expose such reasons via our gRPC API. So, I can just modify the internal Server logic.

But, I think we need to announce this change as breaking in RELEASE NOTE and release announcement in google groups.

Mar 22 '24 07:03 tenzen-y

Yes, we can put ACTION REQUIRED in the notes to highlight at the top.

Mar 22 '24 12:03 alculquicondor

+1. The current message is misleading because when reclaimWithinCohort: Any then a workload might get preempted by a lower priority workload actually.

I'm not sure using reason for this purpose is the right way. Currently we have 3 modes of preemption, we are likely to have more in the future, for example, if fair sharing is implemented. So we may end up adding many reasons, and determining the reasons might be tricky at some point as we need to propagate the information via a couple of functions.

I'm thinking about enriching the message, similarly as we build the events in the k8s scheduler. We could have a message like: `Preempted to accommodate the "UID-x" workload submitted to the "team-y" ClusterQueue."

Mar 27 '24 09:03 mimowo

The thing is that messages are not machine-readable. For direct users of Kueue, messages are acceptable. But if you use kueue as a low-level component, then you might need a reason to distinguish between the different kinds of preemption.

The reasons having a common prefix could be good enough to group them as one thing with different variants.

Btw, in fair sharing, the preemption reason would still be PreemptDueToReclaim (or something similar). It's just that the "reclamation thresholds" are dynamic.

Mar 27 '24 14:03 alculquicondor

The thing is that messages are not machine-readable.

I see, so maybe instead we have an extra condition type Preempted, with a dedicated reason. So we have 2 conditions:

- type: Evicted
  reason: Preempted
  message: preempted to accomodate workload UID
- type: Preempted
  reason: ReclaimWithinCohort / PreemptionWithinClusterQueue / FairSharingWithinCohort ... 
  message: preempted to accomodate workload UID # possibly some more data here

For direct users of Kueue, messages are acceptable. But if you use kueue as a low-level component, then you might need a reason to distinguish between the different kinds of preemption.

Yes, but then when a new reason is added by Kueue, then the external system also needs to be extended for monitoring, when a new version of Kueue introduces a new reason (if the system does not need fine grained information). I guess we could mitigate this by some convention, say all reasons start with "PreemptedDueTo".

Mar 27 '24 14:03 mimowo

I like the idea of a dedicated Preempted condition. It's fully backwards compatible. We just need to make sure that the documentation makes a distinction between the two conditions (or that one is a subset of the other).

Mar 27 '24 14:03 alculquicondor

A separate Preempted condition looks like a good solution, bringing better structure maintaining backward compatibility.

Mar 27 '24 15:03 astefanutti

+1 on having the Evicted and the Preempted condition type. Also, we may need to clarify what the difference is between Preemption and Eviction in the documentation.

Mar 28 '24 18:03 tenzen-y

/assign

Apr 03 '24 08:04 mimowo

I'm wondering if we should distinguish reclaimWithinCohort and preemption while borrowing, at the reason level, as suggested in the issue description.

I think preemption while borrowing can happen it two scenarios:

we are only preempting workloads within ClusterQueue here
we are preempting workloads within cohort (enabled by preemption.BorrowWithinCohort), see here.

Then, there are variants of (2), because the threshold does not need to be specified. Also, preemption.BorrowWithinCohort can only be enabled when preemption.reclaimWithinCohort is enabled, so it is more of an additional configuration option, rather than a new "mode" of preempting.

My suggestion would be to just go with two reasons: PreemptionInClusterQueue and ReclaimWithinCohort . Then, we put the preemptor UID into the message so that one can figure out more if has access to the workload (this is is what I do in the initial version of the PR: https://github.com/kubernetes-sigs/kueue/pull/1942).

Any views?

Apr 04 '24 12:04 mimowo

If we can make the distinction between the two, that would be better IMO.

End users might have access to the preemption condition, but they might not have access to the names or priorities or other workloads. But they still would want to know roughly why they were preempted.

A nit on PreemptionInClusterQueue: It shouldn't start with preemption, because that's already the name of the condition. But maybe we can say HigherPriorityWithinClusterQueue. And similarly UnderPriorityThresholdForPreemptionWithBorrowing (I'm open to ideas for a shorter version).

Apr 04 '24 14:04 alculquicondor

From the offline discussions we want to balance two aspects: satisfy the MVP which distinguishes "In ClusterQueue" and "In cohort", and being as granular as possible.

I think there will always be a possibility, that a new config option or threshold is introduced, making the reason to split into two or more detailed reasons.

In that case, I would suggest to start with something simple for now, two reasons: InClusterQueue and InCohort. Then, we say that reasons prefixed by them are meant to provide more detailed information. So, in the future we may have: InClusterQueueByHigherPriorityWorkload or InCohortByBorrowingWorkload, etc.

Apr 04 '24 15:04 mimowo

kueue kueue copied to clipboard

Distinguish between preemption within ClusterQueue, reclaim within Cohort and preemption with borrowing

kueue
kueue copied to clipboard