volcano volcano scheduler support multi-schedulers deploy

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds support for deploying multiple Volcano schedulers using Helm chart with StatefulSet, enabling the multi-scheduler feature without using selectors.

Key Changes:

Added conditional logic in scheduler.yaml to support both Deployment (single scheduler) and StatefulSet (multi-scheduler) deployment modes
Introduced multi_scheduler_enable configuration option to toggle between deployment types Why we need this: The single scheduler cannot satisfy high throughput requirements in some scenarios. Deploying multiple schedulers improves overall scheduling throughput without requiring manual node labeling or workload modifications.

Which issue(s) this PR fixes:

Fixes #hajnalmt multi-scheduler helm support issue

Special notes for your reviewer:

Testing Instructions:

Single scheduler mode (default): helm install volcano installer/helm/chart/volcano/
Multi-scheduler mode: helm install volcano installer/helm/chart/volcano/ --set custom.multi_scheduler_enable=true --set custom.scheduler_replicas=3

Backward Compatibility:

Default behavior remains unchanged (single scheduler via Deployment)
All existing configurations continue to work without modification
Multi-scheduler mode is opt-in via explicit configuration

Implementation Details:

Uses consistent hashing algorithm for automatic job/node assignment
StatefulSet ensures stable network identities for schedulers
Headless Service enables direct pod-to-pod communication

Does this PR introduce a user-facing change?

Added support for deploying multiple Volcano schedulers via Helm chart using StatefulSet. This enables high-throughput scheduling scenarios by running multiple scheduler instances with automatic load distribution. Use `--set custom.multi_scheduler_enable=true` to enable multi-scheduler mode.
 

#4645

Sep 30 '25 08:09 pinkfloyd-eminem

Welcome @pinkfloyd-eminem!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. :smiley:

Sep 30 '25 08:09 volcano-sh-bot

Summary of Changes

Hello @pinkfloyd-eminem, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Volcano scheduler's deployment capabilities by introducing support for a multi-scheduler setup via Helm charts. By utilizing Kubernetes StatefulSets, it enables the deployment of multiple scheduler instances to meet high-throughput demands, improving overall scheduling performance without requiring manual workload modifications. The feature is opt-in, maintaining backward compatibility for existing single-scheduler deployments.

Highlights

Multi-Scheduler Support: Introduces the ability to deploy multiple Volcano schedulers using Helm charts.
StatefulSet Deployment: Leverages Kubernetes StatefulSets for multi-scheduler deployments, ensuring stable network identities and automatic load distribution.
Configurable Deployment Type: Adds a multi_scheduler_enable option in Helm values to switch between single (Deployment) and multi-scheduler (StatefulSet) modes.
Performance Improvement: Aims to address high-throughput requirements by improving overall scheduling throughput without manual node labeling or workload modifications.
Backward Compatibility: Ensures that the default single-scheduler behavior remains unchanged and the multi-scheduler mode is opt-in.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Sep 30 '25 08:09 gemini-code-assist[bot]

test: single scheduler：helm install volcano installer/helm/chart/volcano/ ok multi schedulers: helm install volcano installer/helm/chart/volcano/ --set custom.multi_scheduler_enable=true --set custom.scheduler_replicas=3 ok

Sep 30 '25 09:09 pinkfloyd-eminem

/assign @hzxuzhonghu

Sep 30 '25 09:09 pinkfloyd-eminem

Hello @pinkfloyd-eminem, Thank you for implementing my feature request! 😊

Can you edit the PR description for me please? At the part: Which issue(s) this PR fixes: Fixes #hajnalmt multi-scheduler helm support issue

This should be: Which issue(s) this PR fixes: Fixes #4645

So my request is properly linked.

Please additionally:

I checked the code and SCHEDULER_POD_NUM shall be the same as replicas, otherwise the feature won't work. So we can use custom.scheduler_replicas there is no need for an overwrite, I was wrong.
Gemini is right that there is a lot of duplication, we don't need a separate helper, but It would be good to derive only at the points where it's needed. I mean for example:

...
{{- if not $multi_scheduler_enabled }}
kind: Deployment
{{- else }}
kind: Statefulset
{{- end }}
...

I would rename the value to multi_scheduler_enabled from multi_scheduler_enable too to use adjective grammer here and comply with the standards.
If it's okay can you start with the Deployment, in a not branch too, as that will be the default so it shall come first logically as in my example above.
The service differs only in the ClusterIP: None part too it's enough to derive only at that point too I think.

Thank you once more for implementing this 👍 Please sign the DCO too.

Oct 09 '25 07:10 hajnalmt

Hello @pinkfloyd-eminem, Thank you for implementing my feature request! 😊

Can you edit the PR description for me please? At the part: Which issue(s) this PR fixes: Fixes #hajnalmt multi-scheduler helm support issue

This should be: Which issue(s) this PR fixes: Fixes #4645

So my request is properly linked.

Please additionally:

I checked the code and SCHEDULER_POD_NUM shall be the same as replicas, otherwise the feature won't work. So we can use custom.scheduler_replicas there is no need for an overwrite, I was wrong.

Gemini is right that there is a lot of duplication, we don't need a separate helper, but It would be good to derive only at the points where it's needed. I mean for example:
...
{{- if not $multi_scheduler_enabled }}
kind: Deployment
{{- else }}
kind: Statefulset
{{- end }}
...
I would rename the value to multi_scheduler_enabled from multi_scheduler_enable too to use adjective grammer here and comply with the standards.

If it's okay can you start with the Deployment, in a not branch too, as that will be the default so it shall come first logically as in my example above.

The service differs only in the ClusterIP: None part too it's enough to derive only at that point too I think.

Thank you once more for implementing this 👍 Please sign the DCO too.

Hi, @hajnalmt Thank you for your feedback and suggestions! 😊 I appreciate you taking the time to review the code.

I’ve made the following updates based on your comments:

Reduced code duplication by using conditional derivation only where needed
Renamed the value from multi_scheduler_enable to multi_scheduler_enabled for better grammar and compliance with standards.
Adjusted the logical order to prioritize the Deployment branch (default case) as you suggested.
Simplified the Service template to conditionally set ClusterIP: None only for the multi-scheduler case.

Verified through testing - I've successfully tested both the non-multi-replica and multi-replica Helm deployment scenarios to ensure everything works as expected. Signed DCO - All commits now include the required Signed-off-by line.

Thanks again for your guidance!

Oct 09 '25 09:10 pinkfloyd-eminem

Thank you @pinkfloyd-eminem for doing the changes! This looks really good, I have one minor request left, since there is no purpose in defining the SCHEDULER_NUM env var for the Deployment, I think we shall move that part after the if branch too like:

            {{- if $multi_scheduler_enabled }}
            - name: MULTI_SCHEDULER_ENABLE
              value: "{{ $multi_scheduler_enabled }}"
            - name: SCHEDULER_NUM
              value: "{{ .Values.custom.scheduler_replicas }}"
            - name: SCHEDULER_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            {{- end }}

Oct 09 '25 11:10 hajnalmt

Thank you @pinkfloyd-eminem for doing the changes! This looks really good, I have one minor request left, since there is no purpose in defining the SCHEDULER_NUM env var for the Deployment, I think we shall move that part after the if branch too like:
            {{- if $multi_scheduler_enabled }}
            - name: MULTI_SCHEDULER_ENABLE
              value: "{{ $multi_scheduler_enabled }}"
            - name: SCHEDULER_NUM
              value: "{{ .Values.custom.scheduler_replicas }}"
            - name: SCHEDULER_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            {{- end }}

Thank you for your suggestion. I have adopted your suggestion and submitted the edit.

Oct 09 '25 11:10 pinkfloyd-eminem

Thank you! Perfect.

/cc @Monokaix /cc @JesseStutler

Oct 09 '25 11:10 hajnalmt

/ok-to-test

Oct 10 '25 11:10 hzxuzhonghu

Thanks for your contribution, BTW, did you test @pinkfloyd-eminem multi-scheduler schedule pods properly? I think maybe in the future we can add an e2e test for multi-scheduler mode

Oct 11 '25 06:10 JesseStutler

@pinkfloyd-eminem Could you fix the CIs? You should run make update-development-yaml to generate the new yaml

Oct 11 '25 06:10 JesseStutler

Ray plugin may have a flaky test, you could submit your PR after you fix the code verify and trigger CI again to check whether if the E2E sequence job can pass

Oct 11 '25 06:10 JesseStutler

Ray plugin may have a flaky test, you could submit your PR after you fix the code verify and trigger CI again to check whether if the E2E sequence job can pass

Thanks for the reminder, I will check here

Oct 11 '25 06:10 pinkfloyd-eminem

感谢您的贡献，顺便问一下，您测试过吗？@pinkfloyd-eminem 多调度器能正确调度 Pod 吗？我想未来我们可以为多调度器模式添加端到端测试。

@JesseStutler I tested simple vcjob in multi-scheduler mode and it was successfully scheduled and executed to completion

Oct 11 '25 07:10 pinkfloyd-eminem

Ray plugin may have a flaky test, you could submit your PR after you fix the code verify and trigger CI again to check whether if the E2E sequence job can pass

@JesseStutler I used the "go test -v -ginkgo.focus="Ray Plugin E2E Test"" command to test the Ray plugin and found that the image with DefaultRayImage = "bitnami/ray:2.49.0" could not be pulled down. However, it can be successfully run by changing it to rayproject/ray:2.49.0. Should I consider changing the default image of Ray?

Oct 11 '25 08:10 pinkfloyd-eminem

@pinkfloyd-eminem

Yes, you can change the default image (both of them worked, bitnami/ray:2.49.0 was the smallest image in size).

The all of bitnami/ray images are removed and not available now (most images of bitnami are not available). rayproject/ray:2.49.0 is another option for us.

@JesseStutler If I can upload a image into volcano repository in docker repository, I build a private test image for e2e test.

Oct 11 '25 17:10 Wonki4

@pinkfloyd-eminem The CI issue shall be handled in a separate PR and I think Wonki4 handles it in https://github.com/volcano-sh/volcano/pull/4670 So let's wait until it get's into the repository and retrigger the test with the [/]retest afterward. You can then rebase and update the development.yaml with make update-development-yaml after that ☺️

Oct 12 '25 08:10 hajnalmt

@pinkfloyd-eminem The CI issue shall be handled in a separate PR and I think Wonki4 handles it in #4670 So let's wait until it get's into the repository and retrigger the test with the [/]retest afterward. You can then rebase and update the development.yaml with make update-development-yaml after that ☺️

Thank you, I will wait for the changes，rebase and submit again.

Oct 12 '25 12:10 pinkfloyd-eminem

@pinkfloyd-eminem #4670 has already merged, please rebase the latest code and fix the code verify ci, thanks

Oct 16 '25 02:10 JesseStutler

@pinkfloyd-eminem #4670 has already merged, please rebase the latest code and fix the code verify ci, thanks

Hi, I rebased the latest master branch and passed CI

Oct 16 '25 06:10 pinkfloyd-eminem

/approve

Oct 23 '25 08:10 JesseStutler

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JesseStutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JesseStutler]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 23 '25 08:10 volcano-sh-bot

/cc @hwdef @lowang-bh @hzxuzhonghu @wangyang0616

Oct 30 '25 12:10 JesseStutler

What is the necessity of using StatefulSet?

Nov 02 '25 16:11 hwdef

@hwdef good comment

I feel current multischedulers is very premature. @pinkfloyd-eminem Do you need this ferature? From my perspective, we need to improve its reliability by allowing schedulres dynamically die or run without influencing pod scheduling latency.

Nov 03 '25 01:11 hzxuzhonghu

What is the necessity of using StatefulSet?

Looks like current implementation of multi-scheduler depends on the statefulset-generated stable pod index: https://github.com/volcano-sh/volcano/blob/74976dee5ac08a38ad0b95a81ef88c573bcb3b9e/pkg/scheduler/cache/util.go#L106. We can see that the SCHEDULER_POD_NAME env is got from metadata.name, and the hash key in each process must be the same.

Nov 03 '25 01:11 JesseStutler

@hwdef good comment

I feel current multischedulers is very premature. @pinkfloyd-eminem Do you need this ferature? From my perspective, we need to improve its reliability by allowing schedulres dynamically die or run without influencing pod scheduling latency.

The requirement is raised from #4645. /cc @hajnalmt

Nov 03 '25 01:11 JesseStutler

I feel current multischedulers is very premature.

I agree with this. I believe that increasing scheduling throughput by deploying multiple schedulers is a systematic and large-scale engineering effort, not just a matter of scaling replicas. We need to consider factors such as node partitioning, scheduling conflicts, and so on.

Nov 03 '25 06:11 hwdef

Thanks for the input @hwdef and @hzxuzhonghu!

You're absolutely right—this feature is still premature. It lacks proper documentation, doesn’t handle pod failures reliably, and currently has no test coverage. Metrics might also be misleading in its current state (if not straighout unusable), not to mention how it will interact with other features like the networkTopologyAware scheduling, and plugins like the nodegroup plugin.

That said, I still believe merging this PR is a good step forward, even if the feature remains experimental for now.

At the moment, I’m using node selectors to guide schedulers to specific nodes, as described here: https://github.com/volcano-sh/volcano/blob/master/docs/design/node-selector.md It’s a straightforward approach, but I have concerns. The scheduler cache still contains the entire cluster (as it should), so memory usage scales with each deployment. This setup also increases operational overhead. It’s worth noting that this feature isn’t documented either on the official website—I only found it by chance in the design docs. So I am searching and experimenting with alternatives. We also have other experimental features exposed via the Helm chart: https://github.com/volcano-sh/volcano/blob/master/docs/design/multi-scheduler.md I doubt many users are actively using these, but it’s good they exist for those who explore deeper. Interestingly, the feature in this PR is the only one I couldn’t enable directly via Helm, which is why I raised this request.

I believe that increasing scheduling throughput by deploying multiple schedulers is a systematic and large-scale engineering effort, not just a matter of scaling replicas

If it can be a matter of scaling replicas, why shouldn’t it be? From the minimal testing I tried, both node partitioning and scheduling conflict handling seem to work reasonably well in the current implementation.

I think we should discuss which features in Volcano are considered experimental and find a way to mark them clearly. What do you think @hzxuzhonghu @JesseStutler @hwdef ?

Nov 03 '25 09:11 hajnalmt