training-operator IntelMPI support

mpi-operator MPIJob got IntelMPI support already in Summer 2021. Although traning-operator added MPIJob (shortly) after that, it's still missing IntelMPI support.

Related mpi-operator PRs can be seen from this list: https://github.com/kubeflow/mpi-operator/pulls?q=is%3Apr+intel+mpi+is%3Aclosed

PR https://github.com/kubeflow/training-operator/pull/1804 adds IntelMPI env var support, but there are also other things that are needed.

IMHO most important ones from the mpi-operator are:

IntelMPI vs. OpenMPI worker slots format support: https://github.com/kubeflow/mpi-operator/pull/523
Option for which MPI implementation is in question: https://github.com/kubeflow/mpi-operator/pull/283

And these few other PRs could also be relevant:

Connection repeat for robustness: https://github.com/kubeflow/mpi-operator/pull/389
Readiness probe when SSH is used: https://github.com/kubeflow/mpi-operator/pull/425
E2E tests robustness: https://github.com/kubeflow/mpi-operator/pull/417
Examples: https://github.com/kubeflow/mpi-operator/pull/419

May 17 '23 17:05 eero-t

Having (eventually) same API and MPI implementations support for MPIJob as in mpi-operator would help in switching between them & comparing them. I understood there are some differences in how they do things, so this would help in getting some data on whether there's an actual difference in practice.

May 17 '23 18:05 eero-t

Related: #1804

May 17 '23 20:05 johnugeorge

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 23 '23 20:08 github-actions[bot]

/lifecycle frozen

Aug 24 '23 19:08 johnugeorge

training-operator training-operator copied to clipboard

IntelMPI support

training-operator
training-operator copied to clipboard