sedna doc: add proposal for integrating sedna with volcano.

What type of PR is this? /kind design

What this PR does / why we need it:

This PR contains the proposal for integrating Sedna with Volcano for high-performance task scheduling.

Related to LFX'24 Fall Project: https://github.com/kubeedge/kubeedge/issues/5762

Which issue(s) this PR fixes:

Fixes #

Oct 10 '24 05:10 Electronic-Waste

Welcome @Electronic-Waste! It looks like this is your first PR to kubeedge/sedna 🎉

Oct 10 '24 05:10 kubeedge-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign jaypume after the PR has been reviewed. You can assign the PR to them by writing /assign @jaypume in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

docs/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 10 '24 05:10 kubeedge-bot

I think I've updated the enhancement proposal to the latest version.

cc @Shelley-BaoYue 👀

Oct 12 '24 08:10 Electronic-Waste

/ping @tangming1996 @jaypume

Oct 14 '24 03:10 MooreZheng

I think I've updated the enhancement proposal to the latest version.

cc @Shelley-BaoYue 👀

cc @fisherxu

Oct 14 '24 12:10 Shelley-BaoYue

From the current design of Sedna, integrating Volcano may require reconstructing multiple existing paradigms. Taking incremental learning as an example, in the current design, training, validation, and inference are executed sequentially, and there is only one instance of each task. Therefore, the role of podGroup is not very obvious. If we want to carry out the transformation, it may be necessary to transform the existing incremental learning training task into distributed training to maximize the capabilities of Volcano. If distributed training is independently implemented in Sedna, the transformation cost may be very high. It is recommended to integrate the existing distributed training frameworks in Kubeflow, such as PytorchJob, TensorFlowJob, PaddleJob, and so on. Even so, the workload will be relatively large. This may require more detailed design before finally determining how to do it. It is recommended to report the issue to the Sedna community for discussion. The other several learning paradigms are similar.

/ping @MooreZheng @fisherxu @Shelley-BaoYue

Oct 17 '24 06:10 tangming1996

@tangming1996 I see... And maybe LifelongLearningJob is similar to IncrementalLearningJob too. I've added this agenda to the SIG AI Meeting today.

Btw, could you please tell me where I can find the guidance about LifeLongLearningJob? I can't find it on the official website, in which only FederatedLearningJob, IncrementalLearningJob, and JoinInferenceService are included.

Oct 17 '24 08:10 Electronic-Waste

@tangming1996 I see... And maybe LifelongLearningJob is similar to IncrementalLearningJob too. I've added this agenda to the SIG AI Meeting today.明白了。。。也许 LifelongLearningJob 也类似于 IncrementalLearningJob。我已将此议程添加到今天的 SIG AI Meeting 中。

Btw, could you please tell me where I can find the guidance about LifeLongLearningJob? I can't find it on the official website, in which only FederatedLearningJob, IncrementalLearningJob, and JoinInferenceService are included.顺便说一句，您能告诉我在哪里可以找到有关 LifeLongLearningJob 的指南吗？在官网上找不到，官网上只收录了 FederatedLearningJob、IncrementalLearningJob 和 JoinInferenceService。

you can find it in this document https://github.com/kubeedge/sedna/blob/main/docs/proposals/lifelong-learning/lifelong-learning.md

Oct 17 '24 08:10 tangming1996

@tangming1996 Thanks for your guidance!

Oct 17 '24 08:10 Electronic-Waste

@tangming1996 I see... And maybe LifelongLearningJob is similar to IncrementalLearningJob too. I've added this agenda to the SIG AI Meeting today.

It seems like a big talk for this proposal. The meeting host of SIG AI for Oct - Nov is @hsj576. @Electronic-Waste might want to contact him to reserve a 30-minute time slot for the coming routine meetings.

Oct 18 '24 10:10 MooreZheng

From the current design of Sedna, integrating Volcano may require reconstructing multiple existing paradigms. Taking incremental learning as an example, in the current design, training, validation, and inference are executed sequentially, and there is only one instance of each task. Therefore, the role of podGroup is not very obvious. If we want to carry out the transformation, it may be necessary to transform the existing incremental learning training task into distributed training to maximize the capabilities of Volcano. If distributed training is independently implemented in Sedna, the transformation cost may be very high. It is recommended to integrate the existing distributed training frameworks in Kubeflow, such as PytorchJob, TensorFlowJob, PaddleJob, and so on. Even so, the workload will be relatively large. This may require more detailed design before finally determining how to do it. It is recommended to report the issue to the Sedna community for discussion. The other several learning paradigms are similar.

Most of Sedna schemes are of on-cloud training + on-edge inference / distributed inference and do not directly fit in distributed training. It is possible to upgrade lifelong learning using distributed training because it needs to train multiple tasks. But as @Electronic-Waste mentioned, it would not be an easy job integrating kubeflow, kubeedge, sedna, and volcano all together.

As far as I am concerned, it seems that taking federated learning as a starting point would significantly simplify the workload, because federated learning is, of nature, a distributed training scheme.

Oct 18 '24 10:10 MooreZheng

@MooreZheng I think it's worth noticing that training-operator has integrated with volcano and other high-performance schedulers, for example, kueue. And we need no extra work to integrate volcano if using training-operator as training runtime for jobs like FederatedLearningJob.

I agree with your idea about FederatedLearningJob, which is the easiest and of nature scheme for us to transform into distributed training jobs run by training-operator. However, integrating other jobs like IncrementalLearningJob, as you mentioned, has a totally disastrous workload so Integrating those jobs requires the future involvement of other members in the community and definitely can't be completed during the LFX'24 lifecycle.

After all, I will modify my draft proposal and present it during the KubeEdge Community Call and SIG AI Meeting next week. We can discuss it in detail then :)

/cc @Shelley-BaoYue @fisherxu @tangming1996 @MooreZheng @hsj576 👀

Oct 18 '24 11:10 Electronic-Waste

@Shelley-BaoYue @MooreZheng @tangming1996

I will close this PR since the volcano plan has been replaced by training-operator. A new proposal will be raised soon :)

Oct 29 '24 04:10 Electronic-Waste