Failing to build with DS_BUILD_OPS=1 due to missing nccl.h file
Hi, I'm having troubles installing deepspeed with additional flags. When I run
export NCCL_HOME=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/nccl
DS_BUILD_OPS=1 DS_BUILD_TRANSFORMER_INFERENCE=1 pip install --force-reinstall deepspeed --no-cache --no-deps
I get the following error:
building 'deepspeed.ops.dc_op' extension
creating build/temp.linux-x86_64-cpython-311/csrc/compile
/home/conda/envs/petrov_dpo/bin/x86_64-conda-linux-gnu-c++ -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/targets/x86_64-linux/include -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/targets/x86_64-linux/include -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib/stubs -fPIC -I/tmp/pip-install-xj439mbp/deepspeed_e6e2eefe53f349b98db99e60376d7866/csrc/includes -I/tmp/pip-install-xj439mbp/deepspeed_e6e2eefe53f349b98db99e60376d7866/csrc/compile -I/home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/lib/python3.11/site-packages/torch/include -I/home/conda/envs/petrov_dpo/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/include/python3.11 -c csrc/compile/deepcompile.cpp -o build/temp.linux-x86_64-cpython-311/csrc/compile/deepcompile.o -O3 -std=c++17 -g -Wno-reorder -L/home/conda/envs/petrov_dpo/lib -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=dc_op -D_GLIBCXX_USE_CXX11_ABI=1
In file included from /tmp/pip-install-xj439mbp/deepspeed_e6e2eefe53f349b98db99e60376d7866/csrc/includes/deepcompile.h:20,
from csrc/compile/deepcompile.cpp:6:
/home/conda/envs/petrov_dpo/lib/python3.11/site-packages/torch/include/torch/csrc/distributed/c10d/NCCLUtils.hpp:15:10: fatal error: nccl.h: No such file or directory
15 | #include <nccl.h>
| ^~~~~~~~
compilation terminated.
error: command '/home/conda/envs/petrov_dpo/bin/x86_64-conda-linux-gnu-c++' failed with exit code 1
Is there a way to solve it? I'm using cudann installed via conda and thus I have non standard folders for cuda, nvcc and other libraries.
Same issues here. Maybe need a way to specify additional nccl path instead of use cuda path directly. https://github.com/deepspeedai/DeepSpeed/blob/047a7599d24622dfb37fa5e5a32c671b1bb44233/op_builder/dc.py#L40
For example, check NCCL_INCLUDE_PATH
Thanks for the answer. Sorry I have switched from the topic thus not sure that will have time to check your solution, but maybe this question will help smn else. Many thanks.
FYI I made the following patch and applied it to 0.17.6
git apply <<'PATCH'
diff --git a/op_builder/dc.py b/op_builder/dc.py
index 15b25bf3..bce4e97d 100644
--- a/op_builder/dc.py
+++ b/op_builder/dc.py
@@ -33,6 +33,10 @@ class DeepCompileBuilder(TorchCPUOpBuilder):
CUDA_INCLUDE = []
elif not self.is_rocm_pytorch():
CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
+ # If set, append a single NCCL include dir.
+ _nccl_inc = os.environ.get("NCCL_INCLUDE_DIR")
+ if _nccl_inc and _nccl_inc not in CUDA_INCLUDE:
+ CUDA_INCLUDE.append(_nccl_inc)
else:
CUDA_INCLUDE = [
os.path.join(torch.utils.cpp_extension.ROCM_HOME, "include"),
PATCH
Setting and exportingNCCL_INCLUDE_DIR beforehand, of course