flux icon indicating copy to clipboard operation
flux copied to clipboard

[BUG] Can't find nccl when building from source

Open KnowingNothing opened this issue 1 year ago • 5 comments

Describe the bug A clear and concise description of what the bug is.

Can't find libnccl.so when building from source. It seems flux only builds static nccl lib instead of shared lib. But reduce_scatter requires shared nccl lib.

To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

run ./build.sh --arch 80

Expected behavior A clear and concise description of what you expected to happen.

link fails. Cannot find -lnccl

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment

Linux hina 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Jun__6_02:18:23_PDT_2024 Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0

A100 80GB PCIE 8 cards

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

KnowingNothing avatar Aug 03 '24 06:08 KnowingNothing

I tried to fix this by adding the following contents:

add

find_library(NCCL_LIB
             NAMES nccl_static
             PATHS ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib
             NO_DEFAULT_PATH)
if (NCCL_LIB)
  message(STATUS "Found nccl static lib in " ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib)
else()
  message(STATUS "Can't find nccl static lib in " ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib)
endif()
target_include_directories(${LIB_NAME} PRIVATE ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/include)
target_link_libraries(${LIB_NAME} PUBLIC ${NCCL_LIB})

to flux/src/reduce_scatter/CMakeLists.txt (after line 17)

change include_dirs = [root_path / "include", root_path / "src"] to include_dirs = [root_path / "include", root_path / "src", root_path / "3rdparty/nccl/build/include"] in file flux/setup.py line 128.

KnowingNothing avatar Aug 03 '24 06:08 KnowingNothing

cc @zheng-ningxin

wenlei-bao avatar Aug 08 '24 16:08 wenlei-bao

@KnowingNothing Does this still apply ? or no

wenlei-bao avatar Aug 29 '24 17:08 wenlei-bao

@KnowingNothing does this still apply?

wenlei-bao avatar Sep 10 '24 17:09 wenlei-bao

I also met the same issues, solved by conda install -c nvidia nccl

also checked if the path of nccl is included. It should be somewhere in /usr/local but it sometimes not there

For me it is in either a venv environment and /usr/lib/

Zhuohao-Li avatar Oct 26 '24 22:10 Zhuohao-Li