sagemaker-debugger icon indicating copy to clipboard operation
sagemaker-debugger copied to clipboard

fix horovod.torch import error

Open ztlevi opened this issue 2 years ago • 0 comments

Description of changes:

I reproduce the error by running

nvidia-docker run -it http://763104351884.dkr.ecr.us-east-1.amazonaws.com/autogluon-training:0.4.2-gpu-py38-cu112-ubuntu20.04
python3
import horovod.torch

It gives me the following warning.

Extension horovod.torch has not been built: /usr/local/lib/python3.8/dist-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still available.

Adding the MPI check during import should fix this.

Style and formatting:

I have run pre-commit install && pre-commit run --all-files to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

ztlevi avatar Jul 08 '22 22:07 ztlevi