DeepSpeed Compilation error for 0.8.1 with CUDA 11.2

The conda-forge bot is building the new deepspeed packages for 0.8.1. See https://github.com/conda-forge/deepspeed-feedstock/pull/6#issuecomment-1436120690 for context.

All the builds for CUDA 11.2 are failing because of the below error:

  /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_build_env/bin/x86_64-conda-linux-gnu-c++ -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/work=/usr/local/src/conda/deepspeed-0.8.1 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -isystem /usr/local/cuda/include build/temp.linux-x86_64-cpython-311/csrc/lamb/fused_lamb_cuda.o build/temp.linux-x86_64-cpython-311/csrc/lamb/fused_lamb_cuda_kernel.o -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/lamb/fused_lamb_op.cpython-311-x86_64-linux-gnu.so
  building 'deepspeed.ops.quantizer.quantizer_op' extension
  creating build/temp.linux-x86_64-cpython-311/csrc/quantization
  /usr/local/cuda/bin/nvcc -Icsrc/includes -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include/TH -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include/THC -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include/python3.11 -c csrc/quantization/dequantize.cu -o build/temp.linux-x86_64-cpython-311/csrc/quantization/dequantize.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1014\" -DTORCH_EXTENSION_NAME=quantizer_op -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -ccbin /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_build_env/bin/x86_64-conda-linux-gnu-cc -std=c++14
  csrc/includes/reduction_utils.h(170): error: no operator "+" matches these operands
              operand types are: const __half + const __half

  csrc/includes/reduction_utils.h(199): error: no operator "+" matches these operands
              operand types are: const __half2 + const __half2

  csrc/includes/dequantization_utils.h(165): warning: constexpr if statements are a C++17 feature
            detected during:
              instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=__half, numBits=8, qType=quantize::Type::Symmetric, unroll=8, threads=512]"
  csrc/quantization/dequantize.cu(45): here
              instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=__half]"
  csrc/quantization/dequantize.cu(55): here

  csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
            detected during:
              instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=8, T=float]"
  csrc/includes/dequantization_utils.h(75): here
              instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=8, qType=quantize::Type::Asymmetric]"
  csrc/includes/dequantization_utils.h(151): here
              instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
  csrc/includes/dequantization_utils.h(167): here
              instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
  csrc/quantization/dequantize.cu(18): here
              instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
  csrc/quantization/dequantize.cu(47): here
              instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
  csrc/quantization/dequantize.cu(64): here

  csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
            detected during:
              instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=4, T=float]"
  csrc/includes/dequantization_utils.h(78): here
              instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=4, qType=quantize::Type::Asymmetric]"
  csrc/includes/dequantization_utils.h(151): here
              instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
  csrc/includes/dequantization_utils.h(167): here
              instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
  csrc/quantization/dequantize.cu(18): here
              instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
  csrc/quantization/dequantize.cu(51): here
              instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
  csrc/quantization/dequantize.cu(64): here

  4 errors detected in the compilation of "csrc/quantization/dequantize.cu".
  error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
  error: subprocess-exited-with-error
  
  × Running setup.py install for deepspeed did not run successfully.
  │ exit code: 1
  ╰─> See above for output

See also the build script below:

# Deepspeed ops cannot be built without CUDA
if [[ ${cuda_compiler_version} != "None" ]]; then
  export DS_BUILD_OPS=1

  # Set the CUDA arch list from
  # https://github.com/conda-forge/pytorch-cpu-feedstock/blob/2be0b38024b3b5601fcefce40596fc2a5fce4ab7/recipe/build_pytorch.sh#L94

  if [[ ${cuda_compiler_version} == 10.* ]]; then
    export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5+PTX"
  elif [[ ${cuda_compiler_version} == 11.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0+PTX"
  elif [[ ${cuda_compiler_version} == 11.1 ]]; then
    export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0;8.6+PTX"
  elif [[ ${cuda_compiler_version} == 11.2 ]]; then
    export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0;8.6+PTX"
  else
    echo "Unsupported cuda version. edit build.sh"
    exit 1
  fi

else
  export DS_BUILD_OPS=0
fi

# Disable sparse_attn since it requires an exact version of triton==1.0.0
export DS_BUILD_SPARSE_ATTN=0

python -m pip install . -vv

The conda builds were working fine for 0.8.0, so I wonder whether there could be any specific changes to 0.8.1 that could explain this error? Also is CUDA 11.2 officially supported (I could not find the information in this repo)?

Feb 19 '23 23:02 hadim

I am also having issues: I am using Ubuntu 20.04. Compiling version 0.8.1 with my docker image as you can see here also leads to issues. This works:

DS_BUILD_OPS=1 pip install git+https://github.com/microsoft/[email protected]

This does not:

DS_BUILD_OPS=1 pip install git+https://github.com/microsoft/[email protected]

My docker image is using Cuda 11.7

Feb 23 '23 19:02 mallorbc

Same issue, Ubuntu 20.04, cuda 11.7, 0.8.0 compiles but 0.8.1 raised the same error.

Feb 24 '23 14:02 edwardzjl

Ok, so it does not seem to be conda-forge specific.

Feb 24 '23 15:02 hadim

Just here to say I have the same issue compiling 0.8.1 on my repository: https://github.com/P2Enjoy/kohya_ss-docker

Feb 27 '23 17:02 martinobettucci

I also ran into the same issue compiling 0.8.1 with CUDA 11.7. It seems doesn't occur when compiling version 0.8.0

10 errors detected in the compilation of "csrc/quantization/dequantize.cu".
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

Mar 07 '23 11:03 xfu83

same to me with 0.8.2 using cuda 11.3

building 'deepspeed.ops.quantizer.quantizer_op' extension
creating build/temp.linux-x86_64-3.9/csrc/quantization
/g/data/z00/yxs900/installation/cuda/11.3.0/bin/nvcc -Icsrc/includes -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packages/torch/include -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packag
es/torch/include/torch/csrc/api/include -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packages/torch/include/TH -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packages/torch/include/THC -I/ap
ps/python3/3.9.2/include/python3.9 -c csrc/quantization/dequantize.cu -o build/temp.linux-x86_64-3.9/csrc/quantization/dequantize.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_
NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1013" -DTORCH_EXTENSION_NAME=quantizer_
op -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -std=c++14
csrc/includes/reduction_utils.h(170): error: no operator "+" matches these operands
            operand types are: const __half + const __half

csrc/includes/reduction_utils.h(199): error: no operator "+" matches these operands
            operand types are: const __half2 + const __half2

csrc/includes/dequantization_utils.h(165): warning: constexpr if statements are a C++17 feature
          detected during:
            instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=__half, numBits=8, qType=quantize::Type::Symmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(45): here
            instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=__half]"
csrc/quantization/dequantize.cu(55): here

csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
          detected during:
            instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=8, T=float]"
csrc/includes/dequantization_utils.h(75): here
            instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=8, qType=quantize::Type::Asymmetric]"
csrc/includes/dequantization_utils.h(151): here
            instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/includes/dequantization_utils.h(167): here
            instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(18): here
            instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(47): here
            instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
csrc/quantization/dequantize.cu(64): here

csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
          detected during:
            instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=4, T=float]"
csrc/includes/dequantization_utils.h(78): here
            instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=4, qType=quantize::Type::Asymmetric]"
csrc/includes/dequantization_utils.h(151): here
            instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/includes/dequantization_utils.h(167): here
            instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(18): here
            instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(51): here
            instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
csrc/quantization/dequantize.cu(64): here

4 errors detected in the compilation of "csrc/quantization/dequantize.cu".
error: command '/g/data/z00/yxs900/installation/cuda/11.3.0/bin/nvcc' failed with exit code 1

Mar 15 '23 01:03 einzigsue

I met the same issue when compiling version 0.8.3. And after debugging, I found the reason. I built the deepspeed on a machine without gpu, so the nvcc compilation arguments are ignored, but the quantizer need those arguments.

See https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L40-L41 https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L91-L93 https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L623-L631

After I commetted this, I can finish my compilation. https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L40-L41

Mar 22 '23 11:03 jinzhen-lin

DeepSpeed DeepSpeed copied to clipboard

Compilation error for 0.8.1 with CUDA 11.2

DeepSpeed
DeepSpeed copied to clipboard