kueue [KEP] Introduce MultiKueue Dispatcher API

What type of PR is this?

/kind feature

What this PR does / why we need it:

The feature aims to improve performance and practicality by reducing the overhead of distributing workloads to all clusters simultaneously, minimizing the risk of duplicate admissions and unnecessary preemptions. It should prevent triggering autoscaling across multiple worker clusters at the same time.

Which issue(s) this PR fixes:

Fixes #5141

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

May 29 '25 07:05 mszadkow

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

May 29 '25 07:05 k8s-ci-robot

Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
Latest commit	ad5249099092c0532aa02f277da86aaa2252f299
Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6878c99d6fbec300085eb8e9

May 29 '25 07:05 netlify[bot]

/cc @mwielgus @mimowo @tenzen-y

May 29 '25 07:05 mszadkow

Ww need to discuss also the granularity of the timeout as mentioned by @mimowo

should the timeout be global, per manager, or per worker

May 29 '25 07:05 mszadkow

In my opinion this is not if, but how we deliver those levels for timeout, because we already see at leat 2 scenarios that require different levels. One was mentioned in #5141 and the other in #3757.

This is one could be more general, timeout for the similar type of large amount of clusters.

Both performance (distributing and keeping 40 copies of workload in cluster informers can be expensive) and practical (trying all 40 clusters at the very same time can lead to lots of unnecessary preemptions).

This one should be more granular, probably on the worker level, different clusters but not many of them.

To prioritize the use of some clusters over others. For example a user may have one cluster with reservations, and one auto-scaled. The user prefers to first try the reservation cluster, and only as a fallback try autoscaling.

May 29 '25 08:05 mszadkow

Let's start with KEP update for this. /retitle MultiKueue KEP update to introduce MultiKueue Dispatcher API

/release-note-edit

NONE

May 29 '25 11:05 mimowo

/test pull-kueue-test-e2e-kueueviz-main

Jun 20 '25 12:06 mszadkow

@mszadkow the prototype is great, but please separate the implementation into a dedicated PR, so that we can focus on design first.

Jun 26 '25 09:06 mimowo

@mszadkow the prototype is great, but please separate the implementation into a dedicated PR, so that we can focus on design first.

Sure will do

Jun 26 '25 09:06 mszadkow

Landing here. I will check this within this week.

Jun 26 '25 16:06 tenzen-y

LGTM. I 'm not tagging yet to give @tenzen-y a chance for more comments, and think more about spec vs status thread.

Jun 27 '25 19:06 mimowo

/lgtm /assign @tenzen-y for an extra pair of eyes

Jul 01 '25 10:07 mimowo

LGTM label has been added.

Git tree hash: 7144aa04b32bba71948a58634c6b96be2f39395e

Jul 01 '25 10:07 k8s-ci-robot

/assign

Jul 02 '25 23:07 vladikkuzn

@vladikkuzn @mszadkow please address the remaining comment: https://github.com/kubernetes-sigs/kueue/pull/5410#discussion_r2193331651

Jul 09 '25 06:07 mimowo

/lgtm /approve Leaving final approval to @tenzen-y /hold

Jul 10 '25 13:07 mimowo

LGTM label has been added.

Git tree hash: ba769afe78d955aebd28fd202bdb634bc6d27a2f

Jul 10 '25 13:07 k8s-ci-robot

The only unresolved thread is https://github.com/kubernetes-sigs/kueue/pull/5410#discussion_r2172575662. Otherwise, we could make those as resolved.

Jul 17 '25 05:07 tenzen-y

LGTM, leaving final tagging to @tenzen-y

Jul 17 '25 10:07 mimowo

LGTM label has been added.

Git tree hash: ceda05d5de09e6f812eb804f9e817a99ddac2995

Jul 17 '25 12:07 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, mszadkow, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 17 '25 12:07 k8s-ci-robot

/hold cancel

Jul 17 '25 12:07 tenzen-y

kueue kueue copied to clipboard

[KEP] Introduce MultiKueue Dispatcher API

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

kueue
kueue copied to clipboard

Deploy Preview for kubernetes-sigs-kueue canceled.