training-operator Support Volcano Scheduler in Kubeflow Trainer

What you would like to be added?

In Kubeflow Training Operator V1, we support Volcano for gang-scheduling, while Trainer V2 hasn't supported it yet.

Since Volcano is a widely adopted scheduler for AI workloads, it could provide Trainer with more AI-specific scheduling capabilities if we integrate Volcano into Trainer, thus benefiting users who want to schedule pods with Volcano on top of Kubeflow Trainer.

/cc @kubeflow/wg-training-leads @saileshd1402 @astefanutti @juliusvonkohout @franciscojavierarceo @varodrig @rareddy @thesuperzapper @seanlaii @deepanker13 @helenxie-bit @Doris-xm @truc0 @mahdikhashan

Why is this needed?

In #2182, users requested for richer Volcano support in Kubeflow Training Operator V1.

AFAIK, kubeedge/sedna is waiting for the support of Volcano to enable gang-scheduling in edge-cloud environments: https://github.com/kubeedge/sedna/issues/463. One of the reasons why it was paused is due to:

All training workers must have the same parameters: The PyTorchJob CRD in training-operator assumes that all training workers(pods) shared the same training parameters, while FederatedLearningJob CRD in Sedna allows training workers to have different training parameters. So we assume that all training workers have the same training parameters, which will surely put many restrictions on the applied scenarios of Senda Federated Learning V2 but we have no choice.

In Kubeflow Trainer V2, we introduce jobset as the low-level runtime for distributed training, which allows users to define multiple training parameters for different training workers. It's a good choice to adopt Kubeflow Training V2 instead of the V1 version for them.

Based on the reasons above, supporting Volcano can bring users with great values.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Feb 14 '25 12:02 Electronic-Waste

/remove-label lifecycle/needs-triage

Feb 14 '25 12:02 Electronic-Waste

@andreyvelich I find that there is no label like area/scheduling in issues. Could you please create one and attach it to this issue? Thanks a lot.

Feb 14 '25 12:02 Electronic-Waste

That's an amazing feature！

Feb 28 '25 08:02 Monokaix

Nice feature! I would like to know what changes are there in the trainer v2 compared to v1?

Feb 28 '25 08:02 JesseStutler

@JesseStutler Hi, thanks for your interest in Kubeflow Trainer.

In v1, we create CRDs for every kind of ML frameworks, like PyTorchJob, TFJob, PaddleJob. And users need to know about Kubernetes to complete the configuration items in the CRD, which is not friendly with data scientists.

In v2, we unify these CRDs into TrainingRuntime and TrainJob. You can simply consider them as:

TrainingRuntime: A blueprint for training, which will be completed by DevOps Engineers and ML Engineers.
TrainJob: Do mutation on TrainingRuntime. Data scientists can change the model, dataset and other training-related configurations. This will allow them to quickly iterate their ideas and accelerate AI development.

Also, we provide SDK for data scientists to do the mutation. They do not need to know about Kubernetes any more while leveraging the abilities provided by Kubernetes.

You can find more information in this video and the design doc if you are interested in it.

/cc @kubeflow/wg-training-leads @astefanutti Do you have other complements?

Feb 28 '25 09:02 Electronic-Waste

/area gsoc

Feb 28 '25 09:02 Electronic-Waste

Oh that would be great, just like Kserve ServingRuntime, If there are any volcano communities that need adaptation, please let us know, we are willing to contribute

Feb 28 '25 09:02 JesseStutler

Thanks! This issue will be converted to a GSoC project this year. Our communities can bond with each other in this summer!

Feb 28 '25 09:02 Electronic-Waste

Hi，I am currently working on mlops projects, and I am familiar with kubernetes. I also happen to know about kubeflow and volcano, so this topic seems to be suitable for me to participate in. I hope to have the opportunity to contribute this topic to the kubeflow community.

Thanks! This issue will be converted to a GSoC project this year. Our communities can bond with each other in this summer!

Mar 23 '25 07:03 JadeFlute0127

Hi, I'm currently working for AI workload scheduler. Can you assign this to me?

Mar 27 '25 05:03 rudrakshs-cerebras

@rudrakshs-cerebras This issue is created for GSoC project. So probably I can't assign it to you. I'm willing to help if you could take another issue with good-first-issue label:)

Mar 27 '25 06:03 Electronic-Waste

I’m interested in participating in GSoC via this project. I have a strong understanding of AI workload scheduling and Kubernetes in general, and I believe I can contribute effectively to this feature.

Mar 27 '25 09:03 rudrakshs-cerebras

@rudrakshs-cerebras In that case, you can make some contributions to kubeflow and raise your proposal for this project. Following the GSoC guidelines of choosing contributors, we'll review the proposal and take your contributions into consideration to make the final decision. I believe, it will be more fair for other contributors who also want to participate in.

Mar 27 '25 09:03 Electronic-Waste

https://github.com/kubeflow/trainer/issues/1851 Starting with this then :) Thanks, will work on the proposal. @Electronic-Waste

Mar 27 '25 09:03 rudrakshs-cerebras

/assign @Doris-xm

May 29 '25 12:05 Electronic-Waste

Hi @Monokaix @JesseStutler, I'm happy to share with you that this issue has been chosen as one of the GSoC projects for Kubeflow in this year! It's thrilling to see that Kubeflow is going to bond with the Volcano community.

In this summer, @Doris-xm will carry on this project. @rudeigerc and I will serve as the mentor for this project. Let's move forward!

May 29 '25 12:05 Electronic-Waste

The link for the design doc in the issue comment https://github.com/kubeflow/trainer/issues/2437#issuecomment-2690101139 is broken. I found it already though.

Jun 12 '25 12:06 anindya-saha

@anindya-saha Thanks for raising this. I've fixed the broken link.

Jun 12 '25 12:06 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 10 '25 15:09 github-actions[bot]

/remove-lifecycle stale

Sep 10 '25 16:09 andreyvelich

This has been completed: https://github.com/kubeflow/website/pull/4213. Great work @Doris-xm 🎉

Oct 16 '25 13:10 andreyvelich