VisualGLM-6B icon indicating copy to clipboard operation
VisualGLM-6B copied to clipboard

tuning 遇到了 RuntimeError: Error building extension 'fused_adam'

Open dhhcj opened this issue 2 years ago • 11 comments

tuning 时用了默认的指令,出现下了如下错误

2023-05-22 16:51:07,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py39_cu117/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++17 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++17 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o nvcc fatal : Value 'c++17' is not defined for option 'std' [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/torch20/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/nfs_data/VisualGLM-6B-main/finetune_visualglm.py", line 188, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 98, in training_main model, optimizer = setup_model_untrainable_params_and_optimizer(args, model) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 161, in setup_model_untrainable_params_and_optimizer model, optimizer, _, _ = deepspeed.initialize( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer optimizer = FusedAdam( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' VM-3-158-ubuntu:1785083:1800122 [0] NCCL INFO [Service thread] Connection closed by localRank 0 VM-3-158-ubuntu:1785083:1785083 [0] NCCL INFO comm 0x8abbc410 rank 0 nranks 1 cudaDev 0 busId 80 - Abort COMPLETE VM-3-158-ubuntu:1785083:1800126 [0] NCCL INFO [Service thread] Connection closed by localRank 0 VM-3-158-ubuntu:1785083:1785083 [0] NCCL INFO comm 0x8abc35b0 rank 0 nranks 1 cudaDev 0 busId 80 - Abort COMPLETE [2023-05-22 16:51:50,540] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1785083 [2023-05-22 16:51:50,540] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/torch20/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

dhhcj avatar May 22 '23 08:05 dhhcj

我也遇到了同样的问题。

ZeyuZhu0120 avatar May 22 '23 17:05 ZeyuZhu0120

I met the same problem

NewKeyTo avatar May 24 '23 05:05 NewKeyTo

请问您是否能微调训练了?

WangRongsheng avatar May 24 '23 09:05 WangRongsheng

请问您是否能微调训练了?

不能呀,一直卡在这了,不知道是g++问题,还是啥问题

dhhcj avatar May 24 '23 10:05 dhhcj

same problem. Have you solved?

hh0525 avatar May 28 '23 02:05 hh0525

请问您解决了吗

A-Kein avatar Jun 07 '23 14:06 A-Kein

同问

lynquantumman avatar Aug 29 '23 08:08 lynquantumman

请问各位如何解决的呀?

ShowiBin avatar Oct 07 '23 06:10 ShowiBin

请问解决了吗 VisualGLM-6B# bash finetune/finetune_visualglm.sh NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] FAILED: multi_tensor_adam.cuda.o ninja: build stopped: subcommand failed. subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Sat/deepspeed torch/utils/cpp_extension.py raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam'

elesun2018 avatar Oct 17 '23 09:10 elesun2018

参考这里:https://github.com/THUDM/VisualGLM-6B/issues/125#issuecomment-1630407747

1049451037 avatar Oct 23 '23 07:10 1049451037

tuning 时用了默认的指令,出现下了如下错误

2023-05-22 16:51:07,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py39_cu117/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++17 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++17 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o nvcc fatal : Value 'c++17' is not defined for option 'std' [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/torch20/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/torch20/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/nfs_data/VisualGLM-6B-main/finetune_visualglm.py", line 188, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 98, in training_main model, optimizer = setup_model_untrainable_params_and_optimizer(args, model) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/sat/training/deepspeed_training.py", line 161, in setup_model_untrainable_params_and_optimizer model, optimizer, _, _ = deepspeed.initialize( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer optimizer = FusedAdam( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/torch20/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' VM-3-158-ubuntu:1785083:1800122 [0] NCCL INFO [Service thread] Connection closed by localRank 0 VM-3-158-ubuntu:1785083:1785083 [0] NCCL INFO comm 0x8abbc410 rank 0 nranks 1 cudaDev 0 busId 80 - Abort COMPLETE VM-3-158-ubuntu:1785083:1800126 [0] NCCL INFO [Service thread] Connection closed by localRank 0 VM-3-158-ubuntu:1785083:1785083 [0] NCCL INFO comm 0x8abc35b0 rank 0 nranks 1 cudaDev 0 busId 80 - Abort COMPLETE [2023-05-22 16:51:50,540] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1785083 [2023-05-22 16:51:50,540] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/torch20/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '20', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

请问解决了吗?

chenchen333-dev avatar Mar 15 '24 01:03 chenchen333-dev