clearml-agent icon indicating copy to clipboard operation
clearml-agent copied to clipboard

Horovod installation default settings causes environment problems

Open Mert-Ergin opened this issue 4 years ago • 3 comments

I am running experiments with horovod with OpenMPI. Horovod in my environment is built with the following command:

PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod

However, when trains-agent tries to create a copy of this environment I 'assume' it only runs pip install horovod

I face the following issue for pip only environment:

import horovod.keras as hvd
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/keras/__init__.py", line 19, in
from horovod.tensorflow import init
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/__init__.py", line 28, in
from horovod.tensorflow.mpi_ops import allgather, broadcast, _allreduce
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 49, in
MPI_LIB = _load_library('mpi_lib' + get_ext_suffix())
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 45, in _load_library
library = load_library.load_op_library(filename)
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNK10tensorflow8OpKernel4nameB5cxx11Ev

You can find this issue numerous times in horovod issues: https://github.com/horovod/horovod/issues/236 https://github.com/horovod/horovod/issues/431 https://github.com/horovod/horovod/issues/656

After I get this error, I go to the environment, and check the horovod build using horovodrun --check-build and it is not built properly. If I run following two lines I get a successful build:

pip uninstall horovod -y
PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod

However this is automatically deleted when trains-agent tries to run a new experiment.

Is there any solution for this like you did for --extra-index-url?

Mert-Ergin avatar Jun 18 '20 11:06 Mert-Ergin

Hi @Mert-Ergin

Yes horovod is one of the special cases in trains-agent. Like git based pip installs horovod will be installed last, meaning after all the packages are installed. The reason is of course the way horovod installs (compiles) based on the packages already installed in the system.

For example if we have the following packages: numpy tensorflow==1.13 keras==2.2.4 horovod trains-agent will first run pip install numpy tensorflow==1.13 keras==2.2.4 and after those are installed pip install horovod

A few detials to help me:

  1. If you manually install based on these two steps (obviously with the correct packages/versions) do you get a working horovod install ?
  2. Are you using conda? did you try a compiled horovod (see list of channels)? You can add additional conda channels if you need (see here)

bmartinn avatar Jun 18 '20 21:06 bmartinn

Here are the answers:

  1. Yes with a minor tweak. Instead of pip install horovod I used PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod as recommended for OpenMPI installs. If I use pip install horovod, it would install without errors, but I can not import it in python. So I install using the commands mentioned. write python -c 'import horovod.keras as hvd' I get an error. Or if I use

  2. For the error mentioned above, no. It is only pypi packages. However, I also have a conda environment, but it suffers from the same issue with a different error: tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory I have tried compiled horovod, first two most downloaded packages did not work.

Mert-Ergin avatar Jun 19 '20 08:06 Mert-Ergin

Hi @Mert-Ergin

If you are running trains-agent in docker mode, the easiest is to build a docker with horovod (or take one of the pre-built once, they have them for both TF and PyTorch), and use that as "Base Docker Image". That would mean that if horovod is in the requirements, it will not get reinstalled because it is already installed in the docker (if you are uncertain of the horovod version in the docker, edit the "Installed Packages" and remove the version section, i.e. horovod==x.y.y -> horovod)

If you are running in virtual-environment mode, you can pass the environment variables to the trains-agent, it will in turn pass it to the installation, so you can simple run:

export PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib 
export HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu 
export HOROVOD_GPU_ALLREDUCE=NCCL 
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_WITHOUT_PYTORCH=1 
export HOROVOD_WITHOUT_MXNET=1 
trains-agent --queue default --gpus all --detached

** Notice that you can obviously concatenate the environment variables to a single line, I chose to use export so it is easier to read :) For example: HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 trains-agent ...

bmartinn avatar Jun 19 '20 13:06 bmartinn