Open MPI not found in output of mpirun --version.
I run the following command to test horovod: horovodrun -np 4 -H localhost:4 python keras_mnist.py
the error occurs:
2019-11-15 08:51:09.228813: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
Open MPI not found in output of mpirun --version.
Traceback (most recent call last):
File "/opt/conda/bin/horovodrun", line 21, in
Choose one of:
- Install Open MPI 4.0.0+ and re-install Horovod (use --no-cache-dir pip option).
- Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
- Use built-in gloo option (horovodrun --gloo ...).
============================================= !mpirun --version mpirun.real (OpenRTE) 4.0.1
Report bugs to http://www.open-mpi.org/community/help/
Hi @yuanbw , as a temporary measure you can use the following command, which horovodrun is effectively a shortcut for:
mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib
In other words, you can run the following. I have tested it and it works.
mpirun -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python keras_mnist.py
Allow me some time to figure out why the horovodrun shortcut is not working, since the mpi runtime is indeed present and working.
@tlkh yes, the temporary measure you suggested can works. thanks for your suggestion.