training-operator
training-operator copied to clipboard
IntelMPI support
mpi-operator MPIJob got IntelMPI support already in Summer 2021. Although traning-operator added MPIJob (shortly) after that, it's still missing IntelMPI support.
Related mpi-operator PRs can be seen from this list: https://github.com/kubeflow/mpi-operator/pulls?q=is%3Apr+intel+mpi+is%3Aclosed
PR https://github.com/kubeflow/training-operator/pull/1804 adds IntelMPI env var support, but there are also other things that are needed.
IMHO most important ones from the mpi-operator are:
- IntelMPI vs. OpenMPI worker slots format support: https://github.com/kubeflow/mpi-operator/pull/523
- Option for which MPI implementation is in question: https://github.com/kubeflow/mpi-operator/pull/283
And these few other PRs could also be relevant:
- Connection repeat for robustness: https://github.com/kubeflow/mpi-operator/pull/389
- Readiness probe when SSH is used: https://github.com/kubeflow/mpi-operator/pull/425
- E2E tests robustness: https://github.com/kubeflow/mpi-operator/pull/417
- Examples: https://github.com/kubeflow/mpi-operator/pull/419
Having (eventually) same API and MPI implementations support for MPIJob as in mpi-operator would help in switching between them & comparing them. I understood there are some differences in how they do things, so this would help in getting some data on whether there's an actual difference in practice.
Related: #1804
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen