cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

`cupy` wheel and `torch` wheel link to different NCCL shared libraries in RAPIDS CI containers

Open tingyu66 opened this issue 1 year ago • 5 comments

We have experienced the same issue several times in CI wheel-tests workflows when using cupy and torch>=2.2 together:

    torch = import_optional("torch")
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cugraph/utilities/utils.py:455: in import_optional
    return importlib.import_module(mod)
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/__init__.py:237: in <module>
    from torch._C import *  # noqa: F403
E   ImportError: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

The root cause is that cupy points to the builtin libnccl.so (2.16.2) in the container, while pytorch links to libnccl.so (2.20.5) from the nvidia-nccl-cu11 wheel. The older NCCL version is often incompatible with latest pytorch releases, which causes problems when coupling with other PyTorch-derived libraries. When cupy is imported before torch, the older nccl in the system path shadows the version needed by pytorch, resulting in the undefined symbol error mentioned above.

One less-than-ideal solution is to always import torch first, but this approach is rather error-prone for users. We'd like to hear your suggestions @leofang as a core cupy dev on potential workarounds. For example, is there a way to modify environment variables so that CuPy loads NCCL from a non-system path? Thank you!

CC: @alexbarghi-nv @VibhuJawa @naimnv

tingyu66 avatar Jun 05 '24 21:06 tingyu66

To reproduce: docker run --gpus all --rm -it --network=host rapidsai/citestwheel:cuda11.8.0-ubuntu20.04-py3.9 bash pip install cupy-cuda11x pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu118 python -c "import cupy; import torch"

tingyu66 avatar Jun 05 '24 21:06 tingyu66

I feel something is wrong in your container in a way that I haven't fully understood. CuPy lazy-loads all CUDA libraries so import cupy does not trigger the loading of libnccl. Something else does (but I can't tell why).

The only CuPy module that links to libnccl is cupy_backends.cuda.libs.nccl, as can be confirmed as follows:

root@marie:/# for f in $(find / -type f,l -regex '/pyenv/**/.*.so'); do readelf -d $f | grep "nccl.so"; if [[ $? -eq 0 ]]; then echo $f; fi; done
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy_backends/cuda/libs/nccl.cpython-39-x86_64-linux-gnu.so

Now, if you monitor the loaded DSOs you'll see this module (nccl.cpython-39-x86_64-linux-gnu.so) is actually not loaded (by design), but libnccl still gets loaded

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505:	find library=libnccl.so.2.16.2 [0]; searching
      6505:	  trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:	  trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

What's worse, when the import order is swapped two distinct copies of libnccl are loaded, one from the system (as shown above) and the other from the nccl wheel:

root@marie:/# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      6787:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
      6787:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2

So I am not sure what I'm looking at 🤷

Question for @tingyu66: If this container is owned/controlled by RAPIDS, can't you just remove the system NCCL?

leofang avatar Jun 06 '24 00:06 leofang

@leofang I just took another look and think I found the culprit. https://github.com/cupy/cupy/blob/a54b7abfed668e52de7f3eee7b3fe8ccaef34874/cupy/_environment.py#L270-L274

For wheel builds, cupy._environment loads specific CUDA library versions defined from .data/_wheel.json file.

root@1cc5aab-lcedt:~# cat /pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data/_wheel.json

{"cuda": "11.x", "packaging": "pip", "cutensor": {"version": "2.0.1", "filenames": ["libcutensor.so.2.0.1"]}, "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]}, "cudnn": {"version": "8.8.1", "filenames": ["libcudnn.so.8.8.1", "libcudnn_ops_infer.so.8.8.1", "libcudnn_ops_train.so.8.8.1", "libcudnn_cnn_infer.so.8.8.1", "libcudnn_cnn_train.so.8.8.1", "libcudnn_adv_infer.so.8.8.1", "libcudnn_adv_train.so.8.8.1"]}}root@1cc5aab-lcedt:/pyenv/versions/3.9.19/lib/python3.9/site-packages/cupy/.data# 

That explains why the runtime linker was trying to find the exact version 2.16.2 during import and not satisfied with any other libnccl even if RPATH and LD_LIBRARY_PATH are tweaked.

root@marie:/# LD_DEBUG=libs python -c "import cupy" 2>&1 | grep nccl
      6505:	find library=libnccl.so.2.16.2 [0]; searching
      6505:	  trying file=/pyenv/versions/3.9.19/lib/libnccl.so.2.16.2
      6505:	  trying file=/lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling init: /lib/x86_64-linux-gnu/libnccl.so.2.16.2
      6505:	calling fini: /lib/x86_64-linux-gnu/libnccl.so.2.16.2 [0]

Changing "nccl": {"version": "2.16.2", "filenames": ["libnccl.so.2.16.2"]} to {"version": "2.20.5", "filenames": ["libnccl.so.2"]} in _wheel.json to match PyT's requirement and update LD_LIBRARY_PATH:

root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import cupy; import torch" 2>&1 | grep "calling init:.*nccl"
      4041:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
root@1cc5aab-lcedt:~# LD_DEBUG=libs python -c "import torch; import cupy" 2>&1 | grep "calling init:.*nccl"
      4182:	calling init: /pyenv/versions/3.9.19/lib/python3.9/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2

we finally have one nccl loaded. :upside_down_face:

tingyu66 avatar Jun 06 '24 02:06 tingyu66

Ah, good finding, I forgot there's the preload logic...

Could you file a bug in CuPy's issue tracker? I think the preload logic needs to happen as part of the lazy loading, not before it.

Now you have two ways to hack in your CI workflow :D

leofang avatar Jun 06 '24 02:06 leofang

Update: The fix from CuPy is expected to be released in version 13.2.0 sometime this week.

tingyu66 avatar Jun 10 '24 19:06 tingyu66

This was resolved with the cupy upgrade.

ChuckHastings avatar Feb 03 '25 21:02 ChuckHastings