training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Support v2 MPIJob

Open chenxi-seu opened this issue 1 year ago • 8 comments
trafficstars

We noticed that the MPIJob used in the Kubeflow community documentation is v2beta1 version (https://www.kubeflow.org/docs/components/training/mpi/), but it seems that the training-operator only supports the v1 version of MPIJob. Does the training-operator community have plans to support v2? image

Currently, if users need to use both MPIJob and PytorchJob together, they need to install mpi-operator to support v2beta1 MPIJob first, and then install training-operator to use the v1 PytorchJob.

chenxi-seu avatar Dec 26 '23 11:12 chenxi-seu

@chenxi-seu Yes, we have a plan to support MPIJob v2. Please see https://github.com/kubeflow/training-operator/issues/1906.

tenzen-y avatar Dec 26 '23 11:12 tenzen-y

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

chenxi-seu avatar Dec 26 '23 12:12 chenxi-seu

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Yes, you're right.

tenzen-y avatar Dec 26 '23 12:12 tenzen-y

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

mupeifeiyi avatar Mar 25 '24 10:03 mupeifeiyi

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@mupeifeiyi This would be a good example: https://github.com/kubeflow/training-operator/issues/1777#issuecomment-1480720233

tenzen-y avatar Mar 25 '24 18:03 tenzen-y

/> > > @tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@mupeifeiyi This would be a good example: #1777 (comment)

Thanks, the following is useful to me:

    spec:
      containers:
      - args:
        - --enable-scheme=tfjob
        - --enable-scheme=pytorchjob
        - --enable-scheme=mxjob
        - --enable-scheme=xgboostjob
        - --enable-scheme=paddlejob
        command:
        - /manager

mxjob,not mxnetjob

mupeifeiyi avatar Mar 28 '24 05:03 mupeifeiyi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 26 '24 10:06 github-actions[bot]

For the context here, we are planning to implement support for MPIJob V2 as part of Kubeflow Training V2 proposal: https://bit.ly/3WzjTlw

andreyvelich avatar Jun 26 '24 10:06 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 24 '24 20:09 github-actions[bot]

@andreyvelich Should we keep this open?

terrytangyuan avatar Sep 24 '24 23:09 terrytangyuan

@terrytangyuan We track the V2 migration for MPI jobs as part of this issue: https://github.com/kubeflow/training-operator/issues/2217. In Kubeflow V2 API we will support the second version of MPI operator.

andreyvelich avatar Sep 25 '24 15:09 andreyvelich

Got it. Thanks

terrytangyuan avatar Sep 26 '24 02:09 terrytangyuan