clearml-agent
clearml-agent copied to clipboard
Horovod installation default settings causes environment problems
I am running experiments with horovod with OpenMPI. Horovod in my environment is built with the following command:
PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod
However, when trains-agent tries to create a copy of this environment I 'assume' it only runs
pip install horovod
I face the following issue for pip only environment:
import horovod.keras as hvd
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/keras/__init__.py", line 19, in
from horovod.tensorflow import init
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/__init__.py", line 28, in
from horovod.tensorflow.mpi_ops import allgather, broadcast, _allreduce
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 49, in
MPI_LIB = _load_library('mpi_lib' + get_ext_suffix())
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 45, in _load_library
library = load_library.load_op_library(filename)
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNK10tensorflow8OpKernel4nameB5cxx11Ev
You can find this issue numerous times in horovod issues: https://github.com/horovod/horovod/issues/236 https://github.com/horovod/horovod/issues/431 https://github.com/horovod/horovod/issues/656
After I get this error, I go to the environment, and check the horovod build using
horovodrun --check-build
and it is not built properly. If I run following two lines I get a successful build:
pip uninstall horovod -y
PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod
However this is automatically deleted when trains-agent tries to run a new experiment.
Is there any solution for this like you did for --extra-index-url?
Hi @Mert-Ergin
Yes horovod
is one of the special cases in trains-agent.
Like git based pip installs horovod
will be installed last, meaning after all the packages are installed. The reason is of course the way horovod
installs (compiles) based on the packages already installed in the system.
For example if we have the following packages:
numpy
tensorflow==1.13
keras==2.2.4
horovod
trains-agent will first run
pip install numpy tensorflow==1.13 keras==2.2.4
and after those are installed
pip install horovod
A few detials to help me:
Here are the answers:
-
Yes with a minor tweak. Instead of
pip install horovod
I usedPATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod
as recommended for OpenMPI installs. If I use pip install horovod, it would install without errors, but I can not import it in python. So I install using the commands mentioned. writepython -c 'import horovod.keras as hvd
' I get an error. Or if I use -
For the error mentioned above, no. It is only pypi packages. However, I also have a conda environment, but it suffers from the same issue with a different error:
tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory
I have tried compiled horovod, first two most downloaded packages did not work.
Hi @Mert-Ergin
If you are running trains-agent in docker mode, the easiest is to build a docker with horovod (or take one of the pre-built once, they have them for both TF and PyTorch), and use that as "Base Docker Image". That would mean that if horovod is in the requirements, it will not get reinstalled because it is already installed in the docker (if you are uncertain of the horovod version in the docker, edit the "Installed Packages" and remove the version section, i.e. horovod==x.y.y
-> horovod
)
If you are running in virtual-environment mode, you can pass the environment variables to the trains-agent, it will in turn pass it to the installation, so you can simple run:
export PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib
export HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu
export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_WITHOUT_PYTORCH=1
export HOROVOD_WITHOUT_MXNET=1
trains-agent --queue default --gpus all --detached
** Notice that you can obviously concatenate the environment variables to a single line, I chose to use export so it is easier to read :)
For example:
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 trains-agent ...