[build] allow MPI on Unix when NCCL is disabled
Description
CMake logic fixed to allow enabling MPI while NCCL is disabled.
Motivation and Context
MPI is also used on the CPU backend, not only with CUDA, so it makes sense to decouple it properly from NCCL (which is for dealing with multiple Nvidia GPUs).
I thought we no longer use MPI #17624 . Do we ?
MPI is not a hard requirement for Multi-GPUs (of Nvidia or AMD). Hi @stefantalpalaru What was the case when MPI is required for CPU backend? Is there a real senario in your case?
What was the case when MPI is required for CPU backend?
https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/framework/adasum/adasum_mpi.cc
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/communication_common.h#L107
https://github.com/microsoft/onnxruntime/tree/main/orttraining/orttraining/core/framework/communication/mpi
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/python/orttraining_pybind_state.cc#L205
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/core/session/training_session.cc#L355
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/cpu/communication/recv.cc#L3
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/cpu/cpu_training_kernels.cc#L108
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/models/bert/main.cc#L595
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/cpu/communication/send.h#L3
https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/models/gpt2/main.cc#L315
Is there a real senario in your case?
No, I don't need to target the CPU device on my machine.
I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.
I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.
It seems like MPI mostly target ort-training. Hi @pengwa, Do you have any suggestions?
I think MPI were initially used by some legacy features for training, and some POCs for distributed stuff. Both are not actively serving real user scenarios, well the old still be there, we have to keep them before anyone decides to remove all those legacy.
/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline
/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed
Azure Pipelines successfully started running 9 pipeline(s).
Azure Pipelines successfully started running 9 pipeline(s).