kueue
kueue copied to clipboard
MultiKueue: Support cluster role sharing (worker and manager inside one cluster)
What would you like to be added:
I would like to support the cluster role sharing functionality for MultiKueue.
Why is this needed:
This allows us to reduce the number of managed clusters, which mitigates the operation costs for admins.
Indeed, the blocker for this feature request was mostly resolved by JobManagedBy feature and almost major Job integrations like Kubeflow and Ray support the managedBy functionality as well.
Support for cluster role sharing (worker & manager inside one cluster) is out of scope for this KEP. We will get back to the topic once https://github.com/kubernetes/enhancements/pull/4370 is merged and becomes a wider standard.
https://github.com/kubernetes-sigs/kueue/tree/main/keps/693-multikueue#non-goals
Completion requirements:
This enhancement requires the following artifacts:
- [x] Design doc
- [ ] API change
- [x] Docs update
The artifacts should be linked in subsequent comments.
cc @mimowo @mwielgus @gabesaba @PBundyra
Indeed, the blocker for this feature request was mostly resolved by JobManagedBy feature and almost major Job integrations like Kubeflow and Ray support the managedBy functionality as well.
Yeah, it feels that on clusters with 1.32+, with managedBy, it is already supported. We can have one CQ using MK and another CQ not.
Unless I'm missing something it feels like just about adding tests and updating the KEP.
Yeah, it feels that on clusters with 1.32+, with managedBy, it is already supported. We can have one CQ using MK and another CQ not.
Unless I'm missing something it feels like just about adding tests and updating the KEP.
I hope that. Let us check the actual behavior!
/assign
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I think @IrvingMg @mszadkow has a lot of experience setting up MultiKueue so can easily check that
Basically, we can have one CQ using MultiKueue and on the same cluster another CQ using the local cluster.
I think @IrvingMg @mszadkow has a lot of experience setting up MultiKueue so can easily check that
Basically, we can have one CQ using MultiKueue and on the same cluster another CQ using the local cluster.
If @IrvingMg or @mszadkow has enough bandwidth, feel free to take this issue. It would be great if we can such integration or E2E tests to verify the behavior.
/assign
I'm also fine with either integration or e2e tests, I would probably suggest to start with integration tests as easier to write and debug, and see where we are.
I'm also fine with either integration or e2e tests, I would probably suggest to start with integration tests as easier to write and debug, and see where we are.
SGTM
/unassign
I was able to run the tests in GKE.
Manager cluster workloads: job-sample-job-b49sg-50a94 user-queue-single cluster-queue-single True True 78s job-sample-job-h4ds2-cf483 user-queue cluster-queue True True 82s
jobs: sample-job-b49sg Complete 3/3 67s 2m22s sample-job-h4ds2 Complete 3/3 79s 2m25s
Also, I noticed there is no admittedWorkloads, even though there were... (or maybe it's cleared after the job is finished?)
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS
cluster-queue BestEffortFIFO 0 0
cluster-queue-single BestEffortFIFO 0 0
or maybe it's cleared after the job is finished?)
Yes, we only count non-finished admitted workloads: https://github.com/kubernetes-sigs/kueue/blob/main/apis/kueue/v1beta2/clusterqueue_types.go#L541
says Number of admitted workloads that haven't finished yet
@mszadkow please also update the documentation to say that the hybrid mode is supported. By hybrid (or role sharing) we mean the manager cluster taking the role of the manger, but also running some jobs in another CQ.
I will follow with:
- e2e test for manager
- documentation of this feature