cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers

Open tingyu66 opened this issue 8 months ago • 5 comments

We have experienced the same issue several times in CI wheel-tests workflows when using cupy and torch>=2.2 together:

    torch = import_optional("torch")
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cugraph/utilities/utils.py:455: in import_optional
    return importlib.import_module(mod)
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/__init__.py:237: in <module>
    from torch._C import *  # noqa: F403
E   ImportError: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

The root cause is that cupy points to the builtin libnccl.so (2.16.2) in the container, while pytorch links to libnccl.so (2.20.5) from the nvidia-nccl-cu11 wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.

One less-than-ideal solution is to always import torch first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!

CC: @alexbarghi-nv @VibhuJawa @naimnv

tingyu66 avatar Jun 05 '24 21:06 tingyu66