cugraph
cugraph copied to clipboard
`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers
We have experienced the same issue several times in CI wheel-tests workflows when using cupy
and torch>=2.2
together:
torch = import_optional("torch")
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cugraph/utilities/utils.py:455: in import_optional
return importlib.import_module(mod)
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/__init__.py:237: in <module>
from torch._C import * # noqa: F403
E ImportError: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
The root cause is that cupy
points to the builtin libnccl.so
(2.16.2) in the container, while pytorch
links to libnccl.so
(2.20.5) from the nvidia-nccl-cu11
wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.
One less-than-ideal solution is to always import torch
first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!
CC: @alexbarghi-nv @VibhuJawa @naimnv