Travis Addair
Travis Addair
Hey @linjiaqin, the error message is coming from `horovodrun`, so it will only test on the machine you're running it on. It looks like Horovod did not install correctly. Can...
Hi @XueCangQiuYe, can you try `pip uninstall horovod` and rerun `HOROVOD_WITH_MPI=1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod`. It looks like Horovod did not install because it was already installed.
Hey @JohnTaylor2000, is the `IndexError` coming from `gpus[hvd.local_rank()]`? Can you try checking the lengths of that list on each on each node? ``` import socket gpus = tf.config.experimental.list_physical_devices('GPU') if hvd.local_rank()...
Hey @asawanggaa, did you try running with `horovodrun --gloo ...`?
Thanks for reporting @MichaelLtv. @irasit, can you take a look?
@cliffwoolley I'm seeing this issue pop up with the version of NCCL shipped in the `nvidia/cuda:11.6.1-cudnn8-devel-ubuntu18.04` image on Docker Hub. Is this a known issue with this image?
Hey @whatdhack @ft3020997, what versions of CMake are you using?
Hey @ft3020997, thanks for the update! So seems that the MPI package that ships with Conda by default does not contain compilers (https://github.com/conda-forge/openmpi-feedstock/issues/34). What package did you use to install...
@whatdhack, our CI system has coverage for oneccl and MPICH. Where are these versions installed? It may be we need to provide a way to hardcode the path.
Hey @jarednielsen, Adasum does not yet support TF 2.X, though that work is in progress. @Tixxx might be able to tell you more.