ChatGLM-Finetuning icon indicating copy to clipboard operation
ChatGLM-Finetuning copied to clipboard

ChatGLM3四卡训练出错了

Open eanfs opened this issue 1 year ago • 1 comments

[2024-02-04 17:56:47,007] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root... Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root... Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root... Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py311_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Traceback (most recent call last): File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in main() File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init self._configure_optimizer(optimizer, model_parameters) File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer optimizer = FusedAdam( ^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init fused_adam_cuda = FusedAdamBuilder().load() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load return self.jit_load(verbose) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load op_module = load(name=self.name, ^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( ^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library module = importlib.util.module_from_spec(spec) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 573, in module_from_spec File "", line 1233, in create_module File "", line 241, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv Loading extension module fused_adam... Traceback (most recent call last): File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in main() File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init self._configure_optimizer(optimizer, model_parameters) File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer optimizer = FusedAdam( ^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init fused_adam_cuda = FusedAdamBuilder().load() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load return self.jit_load(verbose) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load op_module = load(name=self.name, ^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( ^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library module = importlib.util.module_from_spec(spec) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 573, in module_from_spec File "", line 1233, in create_module File "", line 241, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv Loading extension module fused_adam... Loading extension module fused_adam... Traceback (most recent call last): File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in main() File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize Traceback (most recent call last): File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init main() self._configure_optimizer(optimizer, model_parameters) File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config, basic_optimizer = self._configure_basic_optimizer(model_parameters) ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize ^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init optimizer = FusedAdam( self._configure_optimizer(optimizer, model_parameters) ^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init fused_adam_cuda = FusedAdamBuilder().load() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load basic_optimizer = self._configure_basic_optimizer(model_parameters) ^^^^^^ ^return self.jit_load(verbose)^ ^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load op_module = load(name=self.name, ^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load optimizer = FusedAdam( ^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init fused_adam_cuda = FusedAdamBuilder().load() ^^^ ^return _jit_compile(^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^ ^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load ^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile return self.jit_load(verbose) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load return _import_module_from_library(name, build_directory, is_python_module) op_module = load(name=self.name, ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library return _jit_compile( ^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile module = importlib.util.module_from_spec(spec) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^return _import_module_from_library(name, build_directory, is_python_module)^ ^ File "", line 573, in module_from_spec File "", line 1233, in create_module File "", line 241, in _call_with_frames_removed ImportError : /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library module = importlib.util.module_from_spec(spec) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 573, in module_from_spec File "", line 1233, in create_module File "", line 241, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv [2024-02-04 17:56:50,782] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30665 [2024-02-04 17:56:50,797] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30666 [2024-02-04 17:56:50,807] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30667 [2024-02-04 17:56:50,817] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30668 [2024-02-04 17:56:50,818] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=3', '--train_path', 'data/d2q_0.json', '--model_name_or_path', 'chatglm3-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm3', '--train_type', 'lora', '--freeze_module_name', 'layers.27.,layers.26.,layers.25.,layers.24.', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm3'] exits with return code = 1

eanfs avatar Feb 04 '24 09:02 eanfs

环境坏了, 二进制不兼容, 重新做系统吧 _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv 是c++相关的错误

zhouchanglin-rr avatar Mar 22 '24 07:03 zhouchanglin-rr