kueue icon indicating copy to clipboard operation
kueue copied to clipboard

MultiKueue: Support cluster role sharing (worker and manager inside one cluster)

Open tenzen-y opened this issue 5 months ago • 3 comments

What would you like to be added:

I would like to support the cluster role sharing functionality for MultiKueue.

Why is this needed:

This allows us to reduce the number of managed clusters, which mitigates the operation costs for admins.

Indeed, the blocker for this feature request was mostly resolved by JobManagedBy feature and almost major Job integrations like Kubeflow and Ray support the managedBy functionality as well.

Support for cluster role sharing (worker & manager inside one cluster) is out of scope for this KEP. We will get back to the topic once https://github.com/kubernetes/enhancements/pull/4370 is merged and becomes a wider standard.

https://github.com/kubernetes-sigs/kueue/tree/main/keps/693-multikueue#non-goals

Completion requirements:

This enhancement requires the following artifacts:

  • [x] Design doc
  • [ ] API change
  • [x] Docs update

The artifacts should be linked in subsequent comments.

tenzen-y avatar Jun 20 '25 16:06 tenzen-y

cc @mimowo @mwielgus @gabesaba @PBundyra

tenzen-y avatar Jun 20 '25 16:06 tenzen-y

Indeed, the blocker for this feature request was mostly resolved by JobManagedBy feature and almost major Job integrations like Kubeflow and Ray support the managedBy functionality as well.

Yeah, it feels that on clusters with 1.32+, with managedBy, it is already supported. We can have one CQ using MK and another CQ not.

Unless I'm missing something it feels like just about adding tests and updating the KEP.

mimowo avatar Jun 20 '25 16:06 mimowo

Yeah, it feels that on clusters with 1.32+, with managedBy, it is already supported. We can have one CQ using MK and another CQ not.

Unless I'm missing something it feels like just about adding tests and updating the KEP.

I hope that. Let us check the actual behavior!

tenzen-y avatar Jun 20 '25 16:06 tenzen-y

/assign

tenzen-y avatar Jul 29 '25 18:07 tenzen-y

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 27 '25 19:10 k8s-triage-robot

/remove-lifecycle stale

mimowo avatar Oct 27 '25 19:10 mimowo

I think @IrvingMg @mszadkow has a lot of experience setting up MultiKueue so can easily check that

Basically, we can have one CQ using MultiKueue and on the same cluster another CQ using the local cluster.

mimowo avatar Nov 14 '25 18:11 mimowo

I think @IrvingMg @mszadkow has a lot of experience setting up MultiKueue so can easily check that

Basically, we can have one CQ using MultiKueue and on the same cluster another CQ using the local cluster.

If @IrvingMg or @mszadkow has enough bandwidth, feel free to take this issue. It would be great if we can such integration or E2E tests to verify the behavior.

tenzen-y avatar Nov 14 '25 21:11 tenzen-y

/assign

mszadkow avatar Nov 18 '25 08:11 mszadkow

I'm also fine with either integration or e2e tests, I would probably suggest to start with integration tests as easier to write and debug, and see where we are.

mimowo avatar Nov 18 '25 08:11 mimowo

I'm also fine with either integration or e2e tests, I would probably suggest to start with integration tests as easier to write and debug, and see where we are.

SGTM

/unassign

tenzen-y avatar Nov 18 '25 15:11 tenzen-y

I was able to run the tests in GKE.

Manager cluster workloads: job-sample-job-b49sg-50a94 user-queue-single cluster-queue-single True True 78s job-sample-job-h4ds2-cf483 user-queue cluster-queue True True 82s

jobs: sample-job-b49sg Complete 3/3 67s 2m22s sample-job-h4ds2 Complete 3/3 79s 2m25s

Also, I noticed there is no admittedWorkloads, even though there were... (or maybe it's cleared after the job is finished?) NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS cluster-queue BestEffortFIFO 0 0 cluster-queue-single BestEffortFIFO 0 0

mszadkow avatar Nov 20 '25 14:11 mszadkow

or maybe it's cleared after the job is finished?)

Yes, we only count non-finished admitted workloads: https://github.com/kubernetes-sigs/kueue/blob/main/apis/kueue/v1beta2/clusterqueue_types.go#L541

says Number of admitted workloads that haven't finished yet

mimowo avatar Nov 20 '25 14:11 mimowo

@mszadkow please also update the documentation to say that the hybrid mode is supported. By hybrid (or role sharing) we mean the manager cluster taking the role of the manger, but also running some jobs in another CQ.

mimowo avatar Nov 21 '25 08:11 mimowo

I will follow with:

  • e2e test for manager
  • documentation of this feature

mszadkow avatar Nov 21 '25 09:11 mszadkow