training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

IntelMPI support

Open eero-t opened this issue 2 years ago • 4 comments

mpi-operator MPIJob got IntelMPI support already in Summer 2021. Although traning-operator added MPIJob (shortly) after that, it's still missing IntelMPI support.

Related mpi-operator PRs can be seen from this list: https://github.com/kubeflow/mpi-operator/pulls?q=is%3Apr+intel+mpi+is%3Aclosed

PR https://github.com/kubeflow/training-operator/pull/1804 adds IntelMPI env var support, but there are also other things that are needed.

IMHO most important ones from the mpi-operator are:

  • IntelMPI vs. OpenMPI worker slots format support: https://github.com/kubeflow/mpi-operator/pull/523
  • Option for which MPI implementation is in question: https://github.com/kubeflow/mpi-operator/pull/283

And these few other PRs could also be relevant:

  • Connection repeat for robustness: https://github.com/kubeflow/mpi-operator/pull/389
  • Readiness probe when SSH is used: https://github.com/kubeflow/mpi-operator/pull/425
  • E2E tests robustness: https://github.com/kubeflow/mpi-operator/pull/417
  • Examples: https://github.com/kubeflow/mpi-operator/pull/419

eero-t avatar May 17 '23 17:05 eero-t

Having (eventually) same API and MPI implementations support for MPIJob as in mpi-operator would help in switching between them & comparing them. I understood there are some differences in how they do things, so this would help in getting some data on whether there's an actual difference in practice.

eero-t avatar May 17 '23 18:05 eero-t

Related: #1804

johnugeorge avatar May 17 '23 20:05 johnugeorge

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 23 '23 20:08 github-actions[bot]

/lifecycle frozen

johnugeorge avatar Aug 24 '23 19:08 johnugeorge