DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Failing to build with DS_BUILD_OPS=1 due to missing nccl.h file

Open Sirorezka opened this issue 4 months ago • 3 comments

Hi, I'm having troubles installing deepspeed with additional flags. When I run

export NCCL_HOME=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/nccl
DS_BUILD_OPS=1 DS_BUILD_TRANSFORMER_INFERENCE=1 pip install --force-reinstall deepspeed --no-cache  --no-deps

I get the following error:

 building 'deepspeed.ops.dc_op' extension
      creating build/temp.linux-x86_64-cpython-311/csrc/compile
      /home/conda/envs/petrov_dpo/bin/x86_64-conda-linux-gnu-c++ -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/targets/x86_64-linux/include -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib/stubs -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/targets/x86_64-linux/include -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib -L/home/conda/envs/petrov_dpo/targets/x86_64-linux/lib/stubs -fPIC -I/tmp/pip-install-xj439mbp/deepspeed_e6e2eefe53f349b98db99e60376d7866/csrc/includes -I/tmp/pip-install-xj439mbp/deepspeed_e6e2eefe53f349b98db99e60376d7866/csrc/compile -I/home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/lib/python3.11/site-packages/torch/include -I/home/conda/envs/petrov_dpo/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/conda/envs/petrov_dpo/include -I/home/conda/envs/petrov_dpo/include/python3.11 -c csrc/compile/deepcompile.cpp -o build/temp.linux-x86_64-cpython-311/csrc/compile/deepcompile.o -O3 -std=c++17 -g -Wno-reorder -L/home/conda/envs/petrov_dpo/lib -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=dc_op -D_GLIBCXX_USE_CXX11_ABI=1
      In file included from /tmp/pip-install-xj439mbp/deepspeed_e6e2eefe53f349b98db99e60376d7866/csrc/includes/deepcompile.h:20,
                       from csrc/compile/deepcompile.cpp:6:
      /home/conda/envs/petrov_dpo/lib/python3.11/site-packages/torch/include/torch/csrc/distributed/c10d/NCCLUtils.hpp:15:10: fatal error: nccl.h: No such file or directory
         15 | #include <nccl.h>
            |          ^~~~~~~~
      compilation terminated.
      error: command '/home/conda/envs/petrov_dpo/bin/x86_64-conda-linux-gnu-c++' failed with exit code 1      

Is there a way to solve it? I'm using cudann installed via conda and thus I have non standard folders for cuda, nvcc and other libraries.

Sirorezka avatar Aug 06 '25 13:08 Sirorezka

Same issues here. Maybe need a way to specify additional nccl path instead of use cuda path directly. https://github.com/deepspeedai/DeepSpeed/blob/047a7599d24622dfb37fa5e5a32c671b1bb44233/op_builder/dc.py#L40

For example, check NCCL_INCLUDE_PATH

npuichigo avatar Aug 28 '25 06:08 npuichigo

Thanks for the answer. Sorry I have switched from the topic thus not sure that will have time to check your solution, but maybe this question will help smn else. Many thanks.

Sirorezka avatar Sep 09 '25 14:09 Sirorezka

FYI I made the following patch and applied it to 0.17.6

git apply <<'PATCH'
diff --git a/op_builder/dc.py b/op_builder/dc.py
index 15b25bf3..bce4e97d 100644
--- a/op_builder/dc.py
+++ b/op_builder/dc.py
@@ -33,6 +33,10 @@ class DeepCompileBuilder(TorchCPUOpBuilder):
             CUDA_INCLUDE = []
         elif not self.is_rocm_pytorch():
             CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
+            # If set, append a single NCCL include dir.
+            _nccl_inc = os.environ.get("NCCL_INCLUDE_DIR")
+            if _nccl_inc and _nccl_inc not in CUDA_INCLUDE:
+                CUDA_INCLUDE.append(_nccl_inc)
         else:
             CUDA_INCLUDE = [
                 os.path.join(torch.utils.cpp_extension.ROCM_HOME, "include"),
PATCH

Setting and exportingNCCL_INCLUDE_DIR beforehand, of course

felker avatar Oct 01 '25 03:10 felker