[BUG] Can't find nccl when building from source
Describe the bug A clear and concise description of what the bug is.
Can't find libnccl.so when building from source. It seems flux only builds static nccl lib instead of shared lib. But reduce_scatter requires shared nccl lib.
To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
run ./build.sh --arch 80
Expected behavior A clear and concise description of what you expected to happen.
link fails. Cannot find -lnccl
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment
Linux hina 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Jun__6_02:18:23_PDT_2024 Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0
A100 80GB PCIE 8 cards
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.
I tried to fix this by adding the following contents:
add
find_library(NCCL_LIB
NAMES nccl_static
PATHS ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib
NO_DEFAULT_PATH)
if (NCCL_LIB)
message(STATUS "Found nccl static lib in " ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib)
else()
message(STATUS "Can't find nccl static lib in " ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib)
endif()
target_include_directories(${LIB_NAME} PRIVATE ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/include)
target_link_libraries(${LIB_NAME} PUBLIC ${NCCL_LIB})
to flux/src/reduce_scatter/CMakeLists.txt (after line 17)
change include_dirs = [root_path / "include", root_path / "src"] to include_dirs = [root_path / "include", root_path / "src", root_path / "3rdparty/nccl/build/include"] in file flux/setup.py line 128.
cc @zheng-ningxin
@KnowingNothing Does this still apply ? or no
@KnowingNothing does this still apply?
I also met the same issues, solved by conda install -c nvidia nccl
also checked if the path of nccl is included. It should be somewhere in /usr/local but it sometimes not there
For me it is in either a venv environment and /usr/lib/