ChatGLM-Efficient-Tuning
ChatGLM-Efficient-Tuning copied to clipboard
多卡accelerate异常!!!用的readme的脚本。项目readme里面的脚本。根本就跑不通
File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\train_sft.py", line 28, in main model_args, data_args, training_args, finetuning_args = prepare_args(stage="sft") model_args, data_args, training_args, finetuning_args = prepare_args(stage="sft") File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils\common.py", line 295, in prepare_args
File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils\common.py", line 295, in prepare_args
model_args, data_args, training_args, finetuning_args = parser.parse_args_into_dataclasses() File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses model_args, data_args, training_args, finetuning_args = parser.parse_args_into_dataclasses() obj = dtype(**inputs) File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses
File "<string>", line 116, in __init__
obj = dtype(**inputs)
File "
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1764, in device File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1764, in device return self._setup_devices return self._setup_devices File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\utils\generic.py", line 54, in get
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\utils\generic.py", line 54, in __get__
cached = self.fget(obj) cached = self.fget(obj)
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1695, in _setup_devices File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1695, in _setup_devices self.distributed_state = PartialState(backend=self.ddp_backend)self.distributed_state = PartialState(backend=self.ddp_backend)
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\state.py", line 197, in init File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\state.py", line 197, in init torch.cuda.set_device(self.device)torch.cuda.set_device(self.device)
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\cuda_init_.py", line 350, in set_device File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\cuda_init_.py", line 350, in set_device torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
AttributeErrorAttributeError: module 'torch._C' has no attribute '_cuda_setDevice': module 'torch._C' has no attribute '_cuda_setDevice'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26208) of binary: D:\ProgramData\anaconda3\envs\py10\python.exe
Traceback (most recent call last):
File "D:\ProgramData\anaconda3\envs\py10\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\ProgramData\anaconda3\envs\py10\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\ProgramData\anaconda3\envs\py10\Scripts\accelerate.exe_main.py", line 7, in
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\commands\launch.py", line 928, in launch_command multi_gpu_launcher(args)
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\commands\launch.py", line 627, in multi_gpu_launcher
distrib_run.run(args)
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train_sft.py FAILED
Failures: [1]: time : 2023-07-03_23:45:01 host : XTZJ-20210729YS rank : 1 (local_rank: 1) exitcode : 1 (pid: 26228) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-07-03_23:45:01 host : XTZJ-20210729YS rank : 0 (local_rank: 0) exitcode : 1 (pid: 26208) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(chatglm_etuning) PS D:\ai_zeng\ChatGLM-Efficient-Tuning> accelerate launch src/train_sft.py --config_file accelerate_config.yaml --do_train --dataset book_train_3 --finetuning_type lora --output_dir path_to_sft_checkpoint_5e_5 --per_device_train_batch_size 8 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 100.0 --fp16 --ddp_find_unused_parameters False
NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [XTZJ-20210729YS]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [XTZJ-20210729YS]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
Traceback (most recent call last):
File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\train_sft.py", line 6, in
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft_init_.py", line 22, in
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\peft_model.py", line 31, in
OSError: [WinError 193] %1 不是有效的 Win32 应用程序。
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2104) of binary: D:\ProgramData\anaconda3\envs\chatglm_etuning\python.exe
Traceback (most recent call last):
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\Scripts\accelerate.exe_main.py", line 7, in
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\accelerate\commands\launch.py", line 928, in launch_command
multi_gpu_launcher(args)
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\accelerate\commands\launch.py", line 627, in multi_gpu_launcher
distrib_run.run(args)
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train_sft.py FAILED
Failures: [1]: time : 2023-07-03_23:54:24 host : XTZJ-20210729YS exitcode : 1 (pid: 4312) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-07-03_23:54:24 host : XTZJ-20210729YS rank : 0 (local_rank: 0) exitcode : 1 (pid: 2104) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
accelerate launch src/train_sft.py --config_file accelerate_config.yaml --do_train --dataset book_train_3 --finetuning_type lora --output_dir path_to_sft_checkpoint_5e_5 --per_device_train_batch_size 8 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 100.0 --fp16 --ddp_find_unused_parameters False 以上项目readme里面的脚本。根本就跑不通