kueue icon indicating copy to clipboard operation
kueue copied to clipboard

[KEP] Introduce MultiKueue Dispatcher API

Open mszadkow opened this issue 6 months ago • 12 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

The feature aims to improve performance and practicality by reducing the overhead of distributing workloads to all clusters simultaneously, minimizing the risk of duplicate admissions and unnecessary preemptions. It should prevent triggering autoscaling across multiple worker clusters at the same time.

Which issue(s) this PR fixes:

Fixes #5141

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

mszadkow avatar May 29 '25 07:05 mszadkow

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

k8s-ci-robot avatar May 29 '25 07:05 k8s-ci-robot

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
Latest commit ad5249099092c0532aa02f277da86aaa2252f299
Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6878c99d6fbec300085eb8e9

netlify[bot] avatar May 29 '25 07:05 netlify[bot]

/cc @mwielgus @mimowo @tenzen-y

mszadkow avatar May 29 '25 07:05 mszadkow

Ww need to discuss also the granularity of the timeout as mentioned by @mimowo

should the timeout be global, per manager, or per worker

mszadkow avatar May 29 '25 07:05 mszadkow

In my opinion this is not if, but how we deliver those levels for timeout, because we already see at leat 2 scenarios that require different levels. One was mentioned in #5141 and the other in #3757.

This is one could be more general, timeout for the similar type of large amount of clusters.

Both performance (distributing and keeping 40 copies of workload in cluster informers can be expensive) and practical (trying all 40 clusters at the very same time can lead to lots of unnecessary preemptions).

This one should be more granular, probably on the worker level, different clusters but not many of them.

To prioritize the use of some clusters over others. For example a user may have one cluster with reservations, and one auto-scaled. The user prefers to first try the reservation cluster, and only as a fallback try autoscaling.

mszadkow avatar May 29 '25 08:05 mszadkow

Let's start with KEP update for this. /retitle MultiKueue KEP update to introduce MultiKueue Dispatcher API

/release-note-edit

NONE

mimowo avatar May 29 '25 11:05 mimowo

/test pull-kueue-test-e2e-kueueviz-main

mszadkow avatar Jun 20 '25 12:06 mszadkow

@mszadkow the prototype is great, but please separate the implementation into a dedicated PR, so that we can focus on design first.

mimowo avatar Jun 26 '25 09:06 mimowo

@mszadkow the prototype is great, but please separate the implementation into a dedicated PR, so that we can focus on design first.

Sure will do

mszadkow avatar Jun 26 '25 09:06 mszadkow

Landing here. I will check this within this week.

tenzen-y avatar Jun 26 '25 16:06 tenzen-y

LGTM. I 'm not tagging yet to give @tenzen-y a chance for more comments, and think more about spec vs status thread.

mimowo avatar Jun 27 '25 19:06 mimowo

/lgtm /assign @tenzen-y for an extra pair of eyes

mimowo avatar Jul 01 '25 10:07 mimowo

LGTM label has been added.

Git tree hash: 7144aa04b32bba71948a58634c6b96be2f39395e

k8s-ci-robot avatar Jul 01 '25 10:07 k8s-ci-robot

/assign

vladikkuzn avatar Jul 02 '25 23:07 vladikkuzn

@vladikkuzn @mszadkow please address the remaining comment: https://github.com/kubernetes-sigs/kueue/pull/5410#discussion_r2193331651

mimowo avatar Jul 09 '25 06:07 mimowo

/lgtm /approve Leaving final approval to @tenzen-y /hold

mimowo avatar Jul 10 '25 13:07 mimowo

LGTM label has been added.

Git tree hash: ba769afe78d955aebd28fd202bdb634bc6d27a2f

k8s-ci-robot avatar Jul 10 '25 13:07 k8s-ci-robot

The only unresolved thread is https://github.com/kubernetes-sigs/kueue/pull/5410#discussion_r2172575662. Otherwise, we could make those as resolved.

tenzen-y avatar Jul 17 '25 05:07 tenzen-y

LGTM, leaving final tagging to @tenzen-y

mimowo avatar Jul 17 '25 10:07 mimowo

LGTM label has been added.

Git tree hash: ceda05d5de09e6f812eb804f9e817a99ddac2995

k8s-ci-robot avatar Jul 17 '25 12:07 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, mszadkow, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [mimowo,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 17 '25 12:07 k8s-ci-robot

/hold cancel

tenzen-y avatar Jul 17 '25 12:07 tenzen-y