training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Migrate v2 MPI operator to the unified operator

Open terrytangyuan opened this issue 3 years ago • 8 comments

Now that v1 MPI operator has been migrated to this repo https://github.com/kubeflow/training-operator/pull/1457. Let's use this issue to track the progress on v2.

https://github.com/kubeflow/mpi-operator/tree/master/v2

cc @hackerboy01 @zw0610 @alculquicondor @kubeflow/wg-training-leads

terrytangyuan avatar Nov 22 '21 15:11 terrytangyuan

@alculquicondor What is the status for MPI Operator v2 ? Do we have plans to deliver MPI Operator v2 as part of Universal Training Operator in Kubeflow 1.5 ? The Kubeflow 1.5 release deadline is January 15th.

andreyvelich avatar Dec 01 '21 18:12 andreyvelich

We need a contributor to do it. I don't currently have capacity to handle it. That means that likely it wouldn't be possible for January 15th. But I don't think the v1 operator is ready either.

alculquicondor avatar Dec 01 '21 18:12 alculquicondor

cc @ArangoGutierrez

terrytangyuan avatar Feb 08 '22 14:02 terrytangyuan

I want to resurrect this thread. There have been many asks from the community to have v2 mpi operator in training operator. Currently, newer features are merged into v2 mpi. Time have passed since the last discussion and v2 api is stable now. What is our plan here regarding migration? What are the road blocks here? There is confusion in the community the future of v1 mpi as well.

Can we prioritise this? @alculquicondor @terrytangyuan @tenzen-y

johnugeorge avatar Feb 09 '23 10:02 johnugeorge

IIRC, we are planning to donate mpi-operator v2 to kubernetes-sigs. So we should decide whether donate to the kubernetes-sigs or merge the v2 operator to the training-operator, to avoid double management.

https://github.com/kubeflow/community/pull/557

cc: @ArangoGutierrez @denkensk @ahg-g

tenzen-y avatar Feb 09 '23 11:02 tenzen-y

Do we have any new plan here ? Since donate mpi-operator v2 to kubernetes-sigs is seems aborted, should we merge mpi-operator v2 to training-operator ?

kuizhiqing avatar Aug 28 '23 05:08 kuizhiqing

There's also discussion around donating Spark-on-K8s project to Kubeflow (no open issue yet since we are still waiting for governance update). I personally think that project is similar to MPI Operator which not just focus on training. So I am not sure if MPI Operator would be a good fit for training-operator.

terrytangyuan avatar Aug 28 '23 16:08 terrytangyuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 27 '23 10:11 github-actions[bot]