ai-lab icon indicating copy to clipboard operation
ai-lab copied to clipboard

Open MPI not found in output of mpirun --version.

Open yuanbw opened this issue 6 years ago • 2 comments

I run the following command to test horovod: horovodrun -np 4 -H localhost:4 python keras_mnist.py

the error occurs: 2019-11-15 08:51:09.228813: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 Open MPI not found in output of mpirun --version. Traceback (most recent call last): File "/opt/conda/bin/horovodrun", line 21, in run.run() File "/opt/conda/lib/python3.6/site-packages/horovod/run/run.py", line 717, in run mpi_run(settings, common_intfs, env) File "/opt/conda/lib/python3.6/site-packages/horovod/run/mpi_run.py", line 58, in mpi_run 'horovodrun convenience script does not find an installed OpenMPI.\n\n' Exception: horovodrun convenience script does not find an installed OpenMPI.

Choose one of:

  1. Install Open MPI 4.0.0+ and re-install Horovod (use --no-cache-dir pip option).
  2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
  3. Use built-in gloo option (horovodrun --gloo ...).

============================================= !mpirun --version mpirun.real (OpenRTE) 4.0.1

Report bugs to http://www.open-mpi.org/community/help/

yuanbw avatar Nov 15 '19 08:11 yuanbw

Hi @yuanbw , as a temporary measure you can use the following command, which horovodrun is effectively a shortcut for:

mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib

In other words, you can run the following. I have tested it and it works.

mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python keras_mnist.py

Allow me some time to figure out why the horovodrun shortcut is not working, since the mpi runtime is indeed present and working.

tlkh avatar Nov 15 '19 17:11 tlkh

@tlkh yes, the temporary measure you suggested can works. thanks for your suggestion.

yuanbw avatar Nov 16 '19 01:11 yuanbw