onnxruntime [build] allow MPI on Unix when NCCL is disabled

Description

CMake logic fixed to allow enabling MPI while NCCL is disabled.

Motivation and Context

MPI is also used on the CPU backend, not only with CUDA, so it makes sense to decouple it properly from NCCL (which is for dealing with multiple Nvidia GPUs).

Jun 25 '24 23:06 stefantalpalaru

I thought we no longer use MPI #17624 . Do we ?

Jun 26 '24 02:06 snnn

MPI is not a hard requirement for Multi-GPUs (of Nvidia or AMD). Hi @stefantalpalaru What was the case when MPI is required for CPU backend? Is there a real senario in your case?

Jun 26 '24 03:06 wejoncy

What was the case when MPI is required for CPU backend?

https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/framework/adasum/adasum_mpi.cc

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/communication_common.h#L107

https://github.com/microsoft/onnxruntime/tree/main/orttraining/orttraining/core/framework/communication/mpi

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/python/orttraining_pybind_state.cc#L205

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/core/session/training_session.cc#L355

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/cpu/communication/recv.cc#L3

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/cpu/cpu_training_kernels.cc#L108

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/models/bert/main.cc#L595

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/training_ops/cpu/communication/send.h#L3

https://github.com/microsoft/onnxruntime/blob/e2abba18ea9370329ce6894a4eb3e98ad8f11cb6/orttraining/orttraining/models/gpt2/main.cc#L315

Is there a real senario in your case?

No, I don't need to target the CPU device on my machine.

I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.

Jun 26 '24 07:06 stefantalpalaru

I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.

It seems like MPI mostly target ort-training. Hi @pengwa, Do you have any suggestions?

Jun 26 '24 13:06 wejoncy

I think MPI were initially used by some legacy features for training, and some POCs for distributed stuff. Both are not actively serving real user scenarios, well the old still be there, we have to keep them before anyone decides to remove all those legacy.

Jul 04 '24 06:07 pengwa

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline

Jul 08 '24 20:07 snnn

/azp run Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

Jul 08 '24 20:07 snnn

Azure Pipelines successfully started running 9 pipeline(s).

Jul 08 '24 20:07 azure-pipelines[bot]

Azure Pipelines successfully started running 9 pipeline(s).

Jul 08 '24 20:07 azure-pipelines[bot]