volcano scheduler support multi-schedulers deploy
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds support for deploying multiple Volcano schedulers using Helm chart with StatefulSet, enabling the multi-scheduler feature without using selectors.
Key Changes:
- Added conditional logic in scheduler.yaml to support both Deployment (single scheduler) and StatefulSet (multi-scheduler) deployment modes
- Introduced
multi_scheduler_enableconfiguration option to toggle between deployment types Why we need this: The single scheduler cannot satisfy high throughput requirements in some scenarios. Deploying multiple schedulers improves overall scheduling throughput without requiring manual node labeling or workload modifications.
Which issue(s) this PR fixes:
Fixes #hajnalmt multi-scheduler helm support issue
Special notes for your reviewer:
Testing Instructions:
- Single scheduler mode (default):
helm install volcano installer/helm/chart/volcano/ - Multi-scheduler mode:
helm install volcano installer/helm/chart/volcano/ --set custom.multi_scheduler_enable=true --set custom.scheduler_replicas=3
Backward Compatibility:
- Default behavior remains unchanged (single scheduler via Deployment)
- All existing configurations continue to work without modification
- Multi-scheduler mode is opt-in via explicit configuration
Implementation Details:
- Uses consistent hashing algorithm for automatic job/node assignment
- StatefulSet ensures stable network identities for schedulers
- Headless Service enables direct pod-to-pod communication
Does this PR introduce a user-facing change?
Added support for deploying multiple Volcano schedulers via Helm chart using StatefulSet. This enables high-throughput scheduling scenarios by running multiple scheduler instances with automatic load distribution. Use `--set custom.multi_scheduler_enable=true` to enable multi-scheduler mode.
#4645
Welcome @pinkfloyd-eminem!
It looks like this is your first PR to volcano-sh/volcano.
Thank you, and welcome to Volcano. :smiley:
Summary of Changes
Hello @pinkfloyd-eminem, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the Volcano scheduler's deployment capabilities by introducing support for a multi-scheduler setup via Helm charts. By utilizing Kubernetes StatefulSets, it enables the deployment of multiple scheduler instances to meet high-throughput demands, improving overall scheduling performance without requiring manual workload modifications. The feature is opt-in, maintaining backward compatibility for existing single-scheduler deployments.
Highlights
- Multi-Scheduler Support: Introduces the ability to deploy multiple Volcano schedulers using Helm charts.
- StatefulSet Deployment: Leverages Kubernetes StatefulSets for multi-scheduler deployments, ensuring stable network identities and automatic load distribution.
-
Configurable Deployment Type: Adds a
multi_scheduler_enableoption in Helm values to switch between single (Deployment) and multi-scheduler (StatefulSet) modes. - Performance Improvement: Aims to address high-throughput requirements by improving overall scheduling throughput without manual node labeling or workload modifications.
- Backward Compatibility: Ensures that the default single-scheduler behavior remains unchanged and the multi-scheduler mode is opt-in.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
test: single scheduler:helm install volcano installer/helm/chart/volcano/ ok multi schedulers: helm install volcano installer/helm/chart/volcano/ --set custom.multi_scheduler_enable=true --set custom.scheduler_replicas=3 ok
/assign @hzxuzhonghu
Hello @pinkfloyd-eminem, Thank you for implementing my feature request! 😊
Can you edit the PR description for me please? At the part: Which issue(s) this PR fixes: Fixes #hajnalmt multi-scheduler helm support issue
This should be: Which issue(s) this PR fixes: Fixes #4645
So my request is properly linked.
Please additionally:
- I checked the code and SCHEDULER_POD_NUM shall be the same as replicas, otherwise the feature won't work. So we can use
custom.scheduler_replicasthere is no need for an overwrite, I was wrong. - Gemini is right that there is a lot of duplication, we don't need a separate helper, but It would be good to derive only at the points where it's needed. I mean for example:
...
{{- if not $multi_scheduler_enabled }}
kind: Deployment
{{- else }}
kind: Statefulset
{{- end }}
...
- I would rename the value to
multi_scheduler_enabledfrommulti_scheduler_enabletoo to use adjective grammer here and comply with the standards. - If it's okay can you start with the Deployment, in a not branch too, as that will be the default so it shall come first logically as in my example above.
- The service differs only in the
ClusterIP: Nonepart too it's enough to derive only at that point too I think.
Thank you once more for implementing this 👍 Please sign the DCO too.
Hello @pinkfloyd-eminem, Thank you for implementing my feature request! 😊
Can you edit the PR description for me please? At the part: Which issue(s) this PR fixes: Fixes #hajnalmt multi-scheduler helm support issue
This should be: Which issue(s) this PR fixes: Fixes #4645
So my request is properly linked.
Please additionally:
- I checked the code and SCHEDULER_POD_NUM shall be the same as replicas, otherwise the feature won't work. So we can use
custom.scheduler_replicasthere is no need for an overwrite, I was wrong.- Gemini is right that there is a lot of duplication, we don't need a separate helper, but It would be good to derive only at the points where it's needed. I mean for example:
... {{- if not $multi_scheduler_enabled }} kind: Deployment {{- else }} kind: Statefulset {{- end }} ...
- I would rename the value to
multi_scheduler_enabledfrommulti_scheduler_enabletoo to use adjective grammer here and comply with the standards.- If it's okay can you start with the Deployment, in a not branch too, as that will be the default so it shall come first logically as in my example above.
- The service differs only in the
ClusterIP: Nonepart too it's enough to derive only at that point too I think.Thank you once more for implementing this 👍 Please sign the DCO too.
Hi, @hajnalmt Thank you for your feedback and suggestions! 😊 I appreciate you taking the time to review the code.
I’ve made the following updates based on your comments:
- Reduced code duplication by using conditional derivation only where needed
- Renamed the value from multi_scheduler_enable to multi_scheduler_enabled for better grammar and compliance with standards.
- Adjusted the logical order to prioritize the Deployment branch (default case) as you suggested.
- Simplified the Service template to conditionally set ClusterIP: None only for the multi-scheduler case.
Verified through testing - I've successfully tested both the non-multi-replica and multi-replica Helm deployment scenarios to ensure everything works as expected. Signed DCO - All commits now include the required Signed-off-by line.
Thanks again for your guidance!
Thank you @pinkfloyd-eminem for doing the changes! This looks really good, I have one minor request left, since there is no purpose in defining the SCHEDULER_NUM env var for the Deployment, I think we shall move that part after the if branch too like:
{{- if $multi_scheduler_enabled }}
- name: MULTI_SCHEDULER_ENABLE
value: "{{ $multi_scheduler_enabled }}"
- name: SCHEDULER_NUM
value: "{{ .Values.custom.scheduler_replicas }}"
- name: SCHEDULER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
{{- end }}
Thank you @pinkfloyd-eminem for doing the changes! This looks really good, I have one minor request left, since there is no purpose in defining the SCHEDULER_NUM env var for the Deployment, I think we shall move that part after the if branch too like:
{{- if $multi_scheduler_enabled }} - name: MULTI_SCHEDULER_ENABLE value: "{{ $multi_scheduler_enabled }}" - name: SCHEDULER_NUM value: "{{ .Values.custom.scheduler_replicas }}" - name: SCHEDULER_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name {{- end }}
Thank you for your suggestion. I have adopted your suggestion and submitted the edit.
Thank you! Perfect.
/cc @Monokaix /cc @JesseStutler
/ok-to-test
Thanks for your contribution, BTW, did you test @pinkfloyd-eminem multi-scheduler schedule pods properly? I think maybe in the future we can add an e2e test for multi-scheduler mode
@pinkfloyd-eminem Could you fix the CIs? You should run make update-development-yaml to generate the new yaml
Ray plugin may have a flaky test, you could submit your PR after you fix the code verify and trigger CI again to check whether if the E2E sequence job can pass
Ray plugin may have a flaky test, you could submit your PR after you fix the code verify and trigger CI again to check whether if the E2E sequence job can pass
Thanks for the reminder, I will check here
感谢您的贡献,顺便问一下,您测试过吗?@pinkfloyd-eminem 多调度器能正确调度 Pod 吗?我想未来我们可以为多调度器模式添加端到端测试。
@JesseStutler I tested simple vcjob in multi-scheduler mode and it was successfully scheduled and executed to completion
Ray plugin may have a flaky test, you could submit your PR after you fix the code verify and trigger CI again to check whether if the E2E sequence job can pass
@JesseStutler I used the "go test -v -ginkgo.focus="Ray Plugin E2E Test"" command to test the Ray plugin and found that the image with DefaultRayImage = "bitnami/ray:2.49.0" could not be pulled down. However, it can be successfully run by changing it to rayproject/ray:2.49.0. Should I consider changing the default image of Ray?
@pinkfloyd-eminem
Yes, you can change the default image (both of them worked, bitnami/ray:2.49.0 was the smallest image in size).
The all of bitnami/ray images are removed and not available now (most images of bitnami are not available). rayproject/ray:2.49.0 is another option for us.
@JesseStutler If I can upload a image into volcano repository in docker repository, I build a private test image for e2e test.
@pinkfloyd-eminem The CI issue shall be handled in a separate PR and I think Wonki4 handles it in
https://github.com/volcano-sh/volcano/pull/4670
So let's wait until it get's into the repository and retrigger the test with the [/]retest afterward.
You can then rebase and update the development.yaml with make update-development-yaml after that ☺️
@pinkfloyd-eminem The CI issue shall be handled in a separate PR and I think Wonki4 handles it in #4670 So let's wait until it get's into the repository and retrigger the test with the [/]retest afterward. You can then rebase and update the development.yaml with
make update-development-yamlafter that ☺️
Thank you, I will wait for the changes,rebase and submit again.
@pinkfloyd-eminem #4670 has already merged, please rebase the latest code and fix the code verify ci, thanks
@pinkfloyd-eminem #4670 has already merged, please rebase the latest code and fix the code verify ci, thanks
Hi, I rebased the latest master branch and passed CI
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: JesseStutler
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [JesseStutler]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/cc @hwdef @lowang-bh @hzxuzhonghu @wangyang0616
What is the necessity of using StatefulSet?
@hwdef good comment
I feel current multischedulers is very premature. @pinkfloyd-eminem Do you need this ferature? From my perspective, we need to improve its reliability by allowing schedulres dynamically die or run without influencing pod scheduling latency.
What is the necessity of using StatefulSet?
Looks like current implementation of multi-scheduler depends on the statefulset-generated stable pod index: https://github.com/volcano-sh/volcano/blob/74976dee5ac08a38ad0b95a81ef88c573bcb3b9e/pkg/scheduler/cache/util.go#L106. We can see that the SCHEDULER_POD_NAME env is got from metadata.name, and the hash key in each process must be the same.
@hwdef good comment
I feel current multischedulers is very premature. @pinkfloyd-eminem Do you need this ferature? From my perspective, we need to improve its reliability by allowing schedulres dynamically die or run without influencing pod scheduling latency.
The requirement is raised from #4645. /cc @hajnalmt
I feel current multischedulers is very premature.
I agree with this. I believe that increasing scheduling throughput by deploying multiple schedulers is a systematic and large-scale engineering effort, not just a matter of scaling replicas. We need to consider factors such as node partitioning, scheduling conflicts, and so on.
Thanks for the input @hwdef and @hzxuzhonghu!
You're absolutely right—this feature is still premature. It lacks proper documentation, doesn’t handle pod failures reliably, and currently has no test coverage. Metrics might also be misleading in its current state (if not straighout unusable), not to mention how it will interact with other features like the networkTopologyAware scheduling, and plugins like the nodegroup plugin.
That said, I still believe merging this PR is a good step forward, even if the feature remains experimental for now.
At the moment, I’m using node selectors to guide schedulers to specific nodes, as described here: https://github.com/volcano-sh/volcano/blob/master/docs/design/node-selector.md It’s a straightforward approach, but I have concerns. The scheduler cache still contains the entire cluster (as it should), so memory usage scales with each deployment. This setup also increases operational overhead. It’s worth noting that this feature isn’t documented either on the official website—I only found it by chance in the design docs. So I am searching and experimenting with alternatives. We also have other experimental features exposed via the Helm chart: https://github.com/volcano-sh/volcano/blob/master/docs/design/multi-scheduler.md I doubt many users are actively using these, but it’s good they exist for those who explore deeper. Interestingly, the feature in this PR is the only one I couldn’t enable directly via Helm, which is why I raised this request.
I believe that increasing scheduling throughput by deploying multiple schedulers is a systematic and large-scale engineering effort, not just a matter of scaling replicas
If it can be a matter of scaling replicas, why shouldn’t it be? From the minimal testing I tried, both node partitioning and scheduling conflict handling seem to work reasonably well in the current implementation.
I think we should discuss which features in Volcano are considered experimental and find a way to mark them clearly. What do you think @hzxuzhonghu @JesseStutler @hwdef ?