DeepSpeed
DeepSpeed copied to clipboard
[BUG] When using the chatglm model and training with deepspeed, I encountered an error in compiling cpu_adam.
`Using /home/zhangyu/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/zhangyu/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /home/zhangyu/miniconda3/envs/py310/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/TH -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/THC -isystem /home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -c /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o FAILED: custom_cuda_kernel.cuda.o /home/zhangyu/miniconda3/envs/py310/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/TH -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/THC -isystem /home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -c /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_89' [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/TH -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/THC -isystem /home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/home/zhangyu/miniconda3/envs/py310/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o ninja: build stopped: subcommand failed. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:19 │ │ 00 in _run_ninja_build │ │ │ │ 1897 │ │ # To work around this, we pass in the fileno directly and hope that │ │ 1898 │ │ # it is valid. │ │ 1899 │ │ stdout_fileno = 1 │ │ ❱ 1900 │ │ subprocess.run( │ │ 1901 │ │ │ command, │ │ 1902 │ │ │ stdout=stdout_fileno if verbose else subprocess.PIPE, │ │ 1903 │ │ │ stderr=subprocess.STDOUT, │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/subprocess.py:526 in run │ │ │ │ 523 │ │ │ raise │ │ 524 │ │ retcode = process.poll() │ │ 525 │ │ if check and retcode: │ │ ❱ 526 │ │ │ raise CalledProcessError(retcode, process.args, │ │ 527 │ │ │ │ │ │ │ │ │ output=stdout, stderr=stderr) │ │ 528 │ return CompletedProcess(process.args, retcode, stdout, stderr) │ │ 529 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zhangyu/project/ChatGLM-Finetuning/finetuning_freeze.py:141 in
It seems that the issue is caused by the error message "nvcc fatal: Unsupported gpu architecture 'compute_89'", but I'm not sure how to solve it. It looks like you have an Nvidia 4090 graphics card
Running with CUDA 12.0 ,torch 2.0.1 , can solve this problem
@starphantom666, does the solution from @panyuyang work for you as well?
@tjruwase It doesn't work for me.
@avivbrokman, can you try adding torch_adam: true
into the optimizer section of your ds_config? As described here, this will enable torch.optim.Adam
instead of DeepSpeed's cpu_adam, and should avoid the compilation error that you are seeing. The torch.optim.Adam
works fine for cpu offloading.
@tjruwase It worked! But if it works, why bother ever coding cpu_adam
in the first place?
@avivbrokman, glad to hear that it worked.
We wrote cpu_adam to get ~7X speedup over torch_adam. Although torch_adam has improved, cpu_adam is still ~3X faster last time I checked. So, it could still be a good idea to figure out to enable cpu_adam in your environment. But for now, at least you are unblocked.
Closing this issue since we don't have 4090 hardware to repro, and a workaround is available. Please re-open if appropriate.
@avivbrokman, can you try adding
torch_adam: true
into the optimizer section of your ds_config? As described here, this will enabletorch.optim.Adam
instead of DeepSpeed's cpu_adam, and should avoid the compilation error that you are seeing. Thetorch.optim.Adam
works fine for cpu offloading.
hi @tjruwase , about "the optimizer section" you mentioned above, do you mean the section named zero_optimization?