DeepSpeed
DeepSpeed copied to clipboard
Compilation error for 0.8.1 with CUDA 11.2
The conda-forge bot is building the new deepspeed packages for 0.8.1. See https://github.com/conda-forge/deepspeed-feedstock/pull/6#issuecomment-1436120690 for context.
All the builds for CUDA 11.2 are failing because of the below error:
/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_build_env/bin/x86_64-conda-linux-gnu-c++ -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/work=/usr/local/src/conda/deepspeed-0.8.1 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -isystem /usr/local/cuda/include build/temp.linux-x86_64-cpython-311/csrc/lamb/fused_lamb_cuda.o build/temp.linux-x86_64-cpython-311/csrc/lamb/fused_lamb_cuda_kernel.o -L/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/lamb/fused_lamb_op.cpython-311-x86_64-linux-gnu.so
building 'deepspeed.ops.quantizer.quantizer_op' extension
creating build/temp.linux-x86_64-cpython-311/csrc/quantization
/usr/local/cuda/bin/nvcc -Icsrc/includes -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include/TH -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/include/THC -I/home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include/python3.11 -c csrc/quantization/dequantize.cu -o build/temp.linux-x86_64-cpython-311/csrc/quantization/dequantize.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1014\" -DTORCH_EXTENSION_NAME=quantizer_op -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -ccbin /home/conda/feedstock_root/build_artifacts/deepspeed_1676841729762/_build_env/bin/x86_64-conda-linux-gnu-cc -std=c++14
csrc/includes/reduction_utils.h(170): error: no operator "+" matches these operands
operand types are: const __half + const __half
csrc/includes/reduction_utils.h(199): error: no operator "+" matches these operands
operand types are: const __half2 + const __half2
csrc/includes/dequantization_utils.h(165): warning: constexpr if statements are a C++17 feature
detected during:
instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=__half, numBits=8, qType=quantize::Type::Symmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(45): here
instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=__half]"
csrc/quantization/dequantize.cu(55): here
csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
detected during:
instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=8, T=float]"
csrc/includes/dequantization_utils.h(75): here
instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=8, qType=quantize::Type::Asymmetric]"
csrc/includes/dequantization_utils.h(151): here
instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/includes/dequantization_utils.h(167): here
instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(18): here
instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(47): here
instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
csrc/quantization/dequantize.cu(64): here
csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
detected during:
instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=4, T=float]"
csrc/includes/dequantization_utils.h(78): here
instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=4, qType=quantize::Type::Asymmetric]"
csrc/includes/dequantization_utils.h(151): here
instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/includes/dequantization_utils.h(167): here
instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(18): here
instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(51): here
instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
csrc/quantization/dequantize.cu(64): here
4 errors detected in the compilation of "csrc/quantization/dequantize.cu".
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
error: subprocess-exited-with-error
× Running setup.py install for deepspeed did not run successfully.
│ exit code: 1
╰─> See above for output
See also the build script below:
# Deepspeed ops cannot be built without CUDA
if [[ ${cuda_compiler_version} != "None" ]]; then
export DS_BUILD_OPS=1
# Set the CUDA arch list from
# https://github.com/conda-forge/pytorch-cpu-feedstock/blob/2be0b38024b3b5601fcefce40596fc2a5fce4ab7/recipe/build_pytorch.sh#L94
if [[ ${cuda_compiler_version} == 10.* ]]; then
export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5+PTX"
elif [[ ${cuda_compiler_version} == 11.0* ]]; then
export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0+PTX"
elif [[ ${cuda_compiler_version} == 11.1 ]]; then
export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0;8.6+PTX"
elif [[ ${cuda_compiler_version} == 11.2 ]]; then
export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0;8.6+PTX"
else
echo "Unsupported cuda version. edit build.sh"
exit 1
fi
else
export DS_BUILD_OPS=0
fi
# Disable sparse_attn since it requires an exact version of triton==1.0.0
export DS_BUILD_SPARSE_ATTN=0
python -m pip install . -vv
The conda builds were working fine for 0.8.0, so I wonder whether there could be any specific changes to 0.8.1 that could explain this error? Also is CUDA 11.2 officially supported (I could not find the information in this repo)?
I am also having issues: I am using Ubuntu 20.04. Compiling version 0.8.1 with my docker image as you can see here also leads to issues. This works:
DS_BUILD_OPS=1 pip install git+https://github.com/microsoft/[email protected]
This does not:
DS_BUILD_OPS=1 pip install git+https://github.com/microsoft/[email protected]
My docker image is using Cuda 11.7
Same issue, Ubuntu 20.04, cuda 11.7, 0.8.0 compiles but 0.8.1 raised the same error.
Ok, so it does not seem to be conda-forge specific.
Just here to say I have the same issue compiling 0.8.1 on my repository: https://github.com/P2Enjoy/kohya_ss-docker
I also ran into the same issue compiling 0.8.1
with CUDA 11.7
. It seems doesn't occur when compiling version 0.8.0
10 errors detected in the compilation of "csrc/quantization/dequantize.cu".
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
same to me with 0.8.2 using cuda 11.3
building 'deepspeed.ops.quantizer.quantizer_op' extension
creating build/temp.linux-x86_64-3.9/csrc/quantization
/g/data/z00/yxs900/installation/cuda/11.3.0/bin/nvcc -Icsrc/includes -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packages/torch/include -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packag
es/torch/include/torch/csrc/api/include -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packages/torch/include/TH -I/g/data/z00/yxs900/installation/pytorch/v1.12.1/lib/python3.9/site-packages/torch/include/THC -I/ap
ps/python3/3.9.2/include/python3.9 -c csrc/quantization/dequantize.cu -o build/temp.linux-x86_64-3.9/csrc/quantization/dequantize.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_
NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1013" -DTORCH_EXTENSION_NAME=quantizer_
op -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -std=c++14
csrc/includes/reduction_utils.h(170): error: no operator "+" matches these operands
operand types are: const __half + const __half
csrc/includes/reduction_utils.h(199): error: no operator "+" matches these operands
operand types are: const __half2 + const __half2
csrc/includes/dequantization_utils.h(165): warning: constexpr if statements are a C++17 feature
detected during:
instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=__half, numBits=8, qType=quantize::Type::Symmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(45): here
instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=__half]"
csrc/quantization/dequantize.cu(55): here
csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
detected during:
instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=8, T=float]"
csrc/includes/dequantization_utils.h(75): here
instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=8, qType=quantize::Type::Asymmetric]"
csrc/includes/dequantization_utils.h(151): here
instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/includes/dequantization_utils.h(167): here
instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(18): here
instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=8, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(47): here
instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
csrc/quantization/dequantize.cu(64): here
csrc/includes/quantization_utils.h(124): error: no suitable conversion function from "__half" to "float" exists
detected during:
instantiation of "T quantize::Params<quantize::Type::Asymmetric, numBits>::dequantize<T>(int8_t) [with numBits=4, T=float]"
csrc/includes/dequantization_utils.h(78): here
instantiation of "void dequantize::chunk(T *, const int8_t *, dequantize::Params<qType, numBits>) [with T=float, numBits=4, qType=quantize::Type::Asymmetric]"
csrc/includes/dequantization_utils.h(151): here
instantiation of "void dequantize::_to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/includes/dequantization_utils.h(167): here
instantiation of "void dequantize::to_global<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(18): here
instantiation of "void dequantize_kernel<T,numBits,qType,unroll,threads>(T *, const int8_t *, const float *, int, int) [with T=float, numBits=4, qType=quantize::Type::Asymmetric, unroll=8, threads=512]"
csrc/quantization/dequantize.cu(51): here
instantiation of "void launch_dequantize_kernel(T *, const int8_t *, const float *, quantize::Type, int, int, int, cudaStream_t) [with T=float]"
csrc/quantization/dequantize.cu(64): here
4 errors detected in the compilation of "csrc/quantization/dequantize.cu".
error: command '/g/data/z00/yxs900/installation/cuda/11.3.0/bin/nvcc' failed with exit code 1
I met the same issue when compiling version 0.8.3. And after debugging, I found the reason.
I built the deepspeed on a machine without gpu, so the nvcc compilation arguments are ignored, but the quantizer
need those arguments.
See https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L40-L41 https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L91-L93 https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L623-L631
After I commetted this, I can finish my compilation. https://github.com/microsoft/DeepSpeed/blob/94f7da26b632f53b0a29ac54854cb8899e0d5b9e/op_builder/builder.py#L40-L41