kueue
kueue copied to clipboard
[KEP] Introduce MultiKueue Dispatcher API
What type of PR is this?
/kind feature
What this PR does / why we need it:
The feature aims to improve performance and practicality by reducing the overhead of distributing workloads to all clusters simultaneously, minimizing the risk of duplicate admissions and unnecessary preemptions. It should prevent triggering autoscaling across multiple worker clusters at the same time.
Which issue(s) this PR fixes:
Fixes #5141
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Deploy Preview for kubernetes-sigs-kueue canceled.
| Name | Link |
|---|---|
| Latest commit | ad5249099092c0532aa02f277da86aaa2252f299 |
| Latest deploy log | https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6878c99d6fbec300085eb8e9 |
/cc @mwielgus @mimowo @tenzen-y
Ww need to discuss also the granularity of the timeout as mentioned by @mimowo
should the timeout be global, per manager, or per worker
In my opinion this is not if, but how we deliver those levels for timeout, because we already see at leat 2 scenarios that require different levels. One was mentioned in #5141 and the other in #3757.
This is one could be more general, timeout for the similar type of large amount of clusters.
Both performance (distributing and keeping 40 copies of workload in cluster informers can be expensive) and practical (trying all 40 clusters at the very same time can lead to lots of unnecessary preemptions).
This one should be more granular, probably on the worker level, different clusters but not many of them.
To prioritize the use of some clusters over others. For example a user may have one cluster with reservations, and one auto-scaled. The user prefers to first try the reservation cluster, and only as a fallback try autoscaling.
Let's start with KEP update for this. /retitle MultiKueue KEP update to introduce MultiKueue Dispatcher API
/release-note-edit
NONE
/test pull-kueue-test-e2e-kueueviz-main
@mszadkow the prototype is great, but please separate the implementation into a dedicated PR, so that we can focus on design first.
@mszadkow the prototype is great, but please separate the implementation into a dedicated PR, so that we can focus on design first.
Sure will do
Landing here. I will check this within this week.
LGTM. I 'm not tagging yet to give @tenzen-y a chance for more comments, and think more about spec vs status thread.
/lgtm /assign @tenzen-y for an extra pair of eyes
LGTM label has been added.
/assign
@vladikkuzn @mszadkow please address the remaining comment: https://github.com/kubernetes-sigs/kueue/pull/5410#discussion_r2193331651
/lgtm /approve Leaving final approval to @tenzen-y /hold
LGTM label has been added.
The only unresolved thread is https://github.com/kubernetes-sigs/kueue/pull/5410#discussion_r2172575662. Otherwise, we could make those as resolved.
LGTM, leaving final tagging to @tenzen-y
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mimowo, mszadkow, tenzen-y
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mimowo,tenzen-y]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold cancel