DeepSpeed [BUG] When using the chatglm model and training with deepspeed, I encountered an error in compiling cpu

`Using /home/zhangyu/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/zhangyu/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /home/zhangyu/miniconda3/envs/py310/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/TH -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/THC -isystem /home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -c /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o FAILED: custom_cuda_kernel.cuda.o /home/zhangyu/miniconda3/envs/py310/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/TH -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/THC -isystem /home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -c /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_89' [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/TH -isystem /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/include/THC -isystem /home/zhangyu/miniconda3/envs/py310/include -isystem /home/zhangyu/miniconda3/envs/py310/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/home/zhangyu/miniconda3/envs/py310/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o ninja: build stopped: subcommand failed. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:19 │ │ 00 in _run_ninja_build │ │ │ │ 1897 │ │ # To work around this, we pass in the fileno directly and hope that │ │ 1898 │ │ # it is valid. │ │ 1899 │ │ stdout_fileno = 1 │ │ ❱ 1900 │ │ subprocess.run( │ │ 1901 │ │ │ command, │ │ 1902 │ │ │ stdout=stdout_fileno if verbose else subprocess.PIPE, │ │ 1903 │ │ │ stderr=subprocess.STDOUT, │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/subprocess.py:526 in run │ │ │ │ 523 │ │ │ raise │ │ 524 │ │ retcode = process.poll() │ │ 525 │ │ if check and retcode: │ │ ❱ 526 │ │ │ raise CalledProcessError(retcode, process.args, │ │ 527 │ │ │ │ │ │ │ │ │ output=stdout, stderr=stderr) │ │ 528 │ return CompletedProcess(process.args, retcode, stdout, stderr) │ │ 529 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/zhangyu/project/ChatGLM-Finetuning/finetuning_freeze.py:141 in │ │ │ │ 138 │ │ 139 │ │ 140 if name == "main": │ │ ❱ 141 │ main() │ │ 142 │ # CUDA_VISIBLE_DEVICES=3 deepspeed --master_port 6666 finetuning_freeze.py │ │ 143 │ │ │ │ /home/zhangyu/project/ChatGLM-Finetuning/finetuning_freeze.py:113 in main │ │ │ │ 110 │ │ │ │ │ │ │ │ drop_last=True, │ │ 111 │ │ │ │ │ │ │ │ num_workers=0) │ │ 112 │ │ │ ❱ 113 │ model_engine, optimizer, _, _ = deepspeed.initialize(config=conf, │ │ 114 │ │ │ │ │ │ │ │ │ │ │ │ │ │ model=model, │ │ 115 │ │ │ │ │ │ │ │ │ │ │ │ │ │ model_parameters=model.paramete │ │ 116 │ model_engine.train() │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/init.py:165 in │ │ initialize │ │ │ │ 162 │ │ │ │ │ │ │ │ │ │ config=config, │ │ 163 │ │ │ │ │ │ │ │ │ │ config_class=config_class) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ engine = DeepSpeedEngine(args=args, │ │ 166 │ │ │ │ │ │ │ │ │ model=model, │ │ 167 │ │ │ │ │ │ │ │ │ optimizer=optimizer, │ │ 168 │ │ │ │ │ │ │ │ │ model_parameters=model_parameters, │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py:308 │ │ in init │ │ │ │ 305 │ │ │ model_parameters = list(model_parameters) │ │ 306 │ │ │ │ 307 │ │ if has_optimizer: │ │ ❱ 308 │ │ │ self._configure_optimizer(optimizer, model_parameters) │ │ 309 │ │ │ self._configure_lr_scheduler(lr_scheduler) │ │ 310 │ │ │ self._report_progress(0) │ │ 311 │ │ elif self.zero_optimization(): │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py:116 │ │ 2 in _configure_optimizer │ │ │ │ 1159 │ │ │ │ │ msg = f'You are using ZeRO-Offload with a client provided optimizer │ │ 1160 │ │ │ │ │ raise ZeRORuntimeException(msg) │ │ 1161 │ │ else: │ │ ❱ 1162 │ │ │ basic_optimizer = self._configure_basic_optimizer(model_parameters) │ │ 1163 │ │ │ log_dist(f"Using DeepSpeed Optimizer param name {self.optimizer_name()} as b │ │ 1164 │ │ │ │ 1165 │ │ self._check_for_duplicates(basic_optimizer) │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py:121 │ │ 8 in _configure_basic_optimizer │ │ │ │ 1215 │ │ │ else: │ │ 1216 │ │ │ │ if self.zero_use_cpu_optimizer(): │ │ 1217 │ │ │ │ │ from deepspeed.ops.adam import DeepSpeedCPUAdam │ │ ❱ 1218 │ │ │ │ │ optimizer = DeepSpeedCPUAdam(model_parameters, │ │ 1219 │ │ │ │ │ │ │ │ │ │ │ │ **optimizer_parameters, │ │ 1220 │ │ │ │ │ │ │ │ │ │ │ │ adamw_mode=effective_adam_w_mode) │ │ 1221 │ │ │ │ else: │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py: │ │ 94 in init │ │ │ │ 91 │ │ DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1 │ │ 92 │ │ self.adam_w_mode = adamw_mode │ │ 93 │ │ self.fp32_optimizer_states = fp32_optimizer_states │ │ ❱ 94 │ │ self.ds_opt_adam = CPUAdamBuilder().load() │ │ 95 │ │ │ │ 96 │ │ self.ds_opt_adam.create_adam(self.opt_id, lr, betas[0], betas[1], eps, weight_de │ │ 97 │ │ │ │ │ │ │ │ │ should_log_le("info")) │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/op_builder/builde │ │ r.py:453 in load │ │ │ │ 450 │ │ │ │ │ 451 │ │ │ return importlib.import_module(self.absolute_name()) │ │ 452 │ │ else: │ │ ❱ 453 │ │ │ return self.jit_load(verbose) │ │ 454 │ │ │ 455 │ def jit_load(self, verbose=True): │ │ 456 │ │ if not self.is_compatible(verbose): │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/op_builder/builde │ │ r.py:496 in jit_load │ │ │ │ 493 │ │ │ │ cxx_args.append("-DBF16_AVAILABLE") │ │ 494 │ │ │ │ nvcc_args.append("-DBF16_AVAILABLE") │ │ 495 │ │ │ │ ❱ 496 │ │ op_module = load(name=self.name, │ │ 497 │ │ │ │ │ │ sources=self.strip_empty_entries(sources), │ │ 498 │ │ │ │ │ │ extra_include_paths=self.strip_empty_entries(extra_include_path │ │ 499 │ │ │ │ │ │ extra_cflags=cxx_args, │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:12 │ │ 84 in load │ │ │ │ 1281 │ │ ... extra_cflags=['-O2'], │ │ 1282 │ │ ... verbose=True) │ │ 1283 │ ''' │ │ ❱ 1284 │ return _jit_compile( │ │ 1285 │ │ name, │ │ 1286 │ │ [sources] if isinstance(sources, str) else sources, │ │ 1287 │ │ extra_cflags, │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:15 │ │ 08 in _jit_compile │ │ │ │ 1505 │ │ │ │ │ │ │ │ 1506 │ │ │ │ │ │ sources = list(hipified_sources) │ │ 1507 │ │ │ │ │ │ │ ❱ 1508 │ │ │ │ │ _write_ninja_file_and_build_library( │ │ 1509 │ │ │ │ │ │ name=name, │ │ 1510 │ │ │ │ │ │ sources=sources, │ │ 1511 │ │ │ │ │ │ extra_cflags=extra_cflags or [], │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:16 │ │ 23 in _write_ninja_file_and_build_library │ │ │ │ 1620 │ │ │ 1621 │ if verbose: │ │ 1622 │ │ print(f'Building extension module {name}...', file=sys.stderr) │ │ ❱ 1623 │ _run_ninja_build( │ │ 1624 │ │ build_directory, │ │ 1625 │ │ verbose, │ │ 1626 │ │ error_prefix=f"Error building extension '{name}'") │ │ │ │ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:19 │ │ 16 in _run_ninja_build │ │ │ │ 1913 │ │ # mypy thinks it's Optional[BaseException] and doesn't narrow │ │ 1914 │ │ if hasattr(error, 'output') and error.output: # type: ignore[union-attr] │ │ 1915 │ │ │ message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}" # type: igno │ │ ❱ 1916 │ │ raise RuntimeError(message) from e │ │ 1917 │ │ 1918 │ │ 1919 def _get_exec_path(module_name, path): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f4465a9a290> Traceback (most recent call last): File "/home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'`

It seems that the issue is caused by the error message "nvcc fatal: Unsupported gpu architecture 'compute_89'", but I'm not sure how to solve it. It looks like you have an Nvidia 4090 graphics card

May 08 '23 08:05 starphantom666

Running with CUDA 12.0 ，torch 2.0.1 , can solve this problem

May 13 '23 07:05 panyuyang

@starphantom666, does the solution from @panyuyang work for you as well?

May 15 '23 16:05 tjruwase

@tjruwase It doesn't work for me.

Aug 02 '23 23:08 avivbrokman

@avivbrokman, can you try adding torch_adam: true into the optimizer section of your ds_config? As described here, this will enable torch.optim.Adam instead of DeepSpeed's cpu_adam, and should avoid the compilation error that you are seeing. The torch.optim.Adam works fine for cpu offloading.

Aug 03 '23 01:08 tjruwase

@tjruwase It worked! But if it works, why bother ever coding cpu_adam in the first place?

Aug 03 '23 16:08 avivbrokman

@avivbrokman, glad to hear that it worked.

We wrote cpu_adam to get ~7X speedup over torch_adam. Although torch_adam has improved, cpu_adam is still ~3X faster last time I checked. So, it could still be a good idea to figure out to enable cpu_adam in your environment. But for now, at least you are unblocked.

Aug 03 '23 16:08 tjruwase

Closing this issue since we don't have 4090 hardware to repro, and a workaround is available. Please re-open if appropriate.

Aug 10 '23 10:08 tjruwase

@avivbrokman, can you try adding torch_adam: true into the optimizer section of your ds_config? As described here, this will enable torch.optim.Adam instead of DeepSpeed's cpu_adam, and should avoid the compilation error that you are seeing. The torch.optim.Adam works fine for cpu offloading.

hi @tjruwase , about "the optimizer section" you mentioned above, do you mean the section named zero_optimization?

Jul 31 '24 00:07 zydmtaichi

DeepSpeed
DeepSpeed copied to clipboard

[BUG] When using the chatglm model and training with deepspeed, I encountered an error in compiling cpu_adam.

DeepSpeed DeepSpeed copied to clipboard

[BUG] When using the chatglm model and training with deepspeed, I encountered an error in compiling cpu_adam.

DeepSpeed
DeepSpeed copied to clipboard