ChatGLM-Efficient-Tuning icon indicating copy to clipboard operation
ChatGLM-Efficient-Tuning copied to clipboard

多卡accelerate异常!!!用的readme的脚本。项目readme里面的脚本。根本就跑不通

Open ArtificialZeng opened this issue 1 year ago • 2 comments

File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\train_sft.py", line 28, in main model_args, data_args, training_args, finetuning_args = prepare_args(stage="sft") model_args, data_args, training_args, finetuning_args = prepare_args(stage="sft") File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils\common.py", line 295, in prepare_args

  File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils\common.py", line 295, in prepare_args

model_args, data_args, training_args, finetuning_args = parser.parse_args_into_dataclasses() File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses model_args, data_args, training_args, finetuning_args = parser.parse_args_into_dataclasses() obj = dtype(**inputs) File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses

  File "<string>", line 116, in __init__

obj = dtype(**inputs) File "", line 116, in init File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1340, in post_init File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1340, in post_init and (self.device.type != "cuda")and (self.device.type != "cuda")

File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1764, in device File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1764, in device return self._setup_devices return self._setup_devices File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\utils\generic.py", line 54, in get

  File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\utils\generic.py", line 54, in __get__

cached = self.fget(obj) cached = self.fget(obj)

File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1695, in _setup_devices File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\transformers\training_args.py", line 1695, in _setup_devices self.distributed_state = PartialState(backend=self.ddp_backend)self.distributed_state = PartialState(backend=self.ddp_backend)

File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\state.py", line 197, in init File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\state.py", line 197, in init torch.cuda.set_device(self.device)torch.cuda.set_device(self.device)

File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\cuda_init_.py", line 350, in set_device File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\cuda_init_.py", line 350, in set_device torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

AttributeErrorAttributeError: module 'torch._C' has no attribute '_cuda_setDevice': module 'torch._C' has no attribute '_cuda_setDevice'

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26208) of binary: D:\ProgramData\anaconda3\envs\py10\python.exe Traceback (most recent call last): File "D:\ProgramData\anaconda3\envs\py10\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\ProgramData\anaconda3\envs\py10\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "D:\ProgramData\anaconda3\envs\py10\Scripts\accelerate.exe_main.py", line 7, in File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\commands\launch.py", line 928, in launch_command multi_gpu_launcher(args) File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\accelerate\commands\launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\ProgramData\anaconda3\envs\py10\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_sft.py FAILED

Failures: [1]: time : 2023-07-03_23:45:01 host : XTZJ-20210729YS rank : 1 (local_rank: 1) exitcode : 1 (pid: 26228) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-07-03_23:45:01 host : XTZJ-20210729YS rank : 0 (local_rank: 0) exitcode : 1 (pid: 26208) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

ArtificialZeng avatar Jul 03 '23 16:07 ArtificialZeng

(chatglm_etuning) PS D:\ai_zeng\ChatGLM-Efficient-Tuning> accelerate launch src/train_sft.py --config_file accelerate_config.yaml --do_train --dataset book_train_3 --finetuning_type lora --output_dir path_to_sft_checkpoint_5e_5 --per_device_train_batch_size 8 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 100.0 --fp16 --ddp_find_unused_parameters False NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [XTZJ-20210729YS]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [XTZJ-20210729YS]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): Traceback (most recent call last): File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\train_sft.py", line 6, in File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\train_sft.py", line 6, in from utils import ( File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils_init_.py", line 1, in from utils import ( File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils_init_.py", line 1, in from .common import ( File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils\common.py", line 25, in from .common import ( File "D:\ai_zeng\ChatGLM-Efficient-Tuning\src\utils\common.py", line 25, in from peft import ( from peft import ( File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft_init_.py", line 22, in

File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft_init_.py", line 22, in from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\mapping.py", line 16, in from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\mapping.py", line 16, in from .peft_model import ( from .peft_model import ( File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\peft_model.py", line 31, in

File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\peft_model.py", line 31, in from .tuners import ( File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\tuners_init_.py", line 21, in from .tuners import ( File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\tuners_init_.py", line 21, in from .lora import LoraConfig, LoraModel File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\tuners\lora.py", line 40, in from .lora import LoraConfig, LoraModel File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\peft\tuners\lora.py", line 40, in import bitsandbytes as bnb File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes_init_.py", line 5, in import bitsandbytes as bnb File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes_init_.py", line 5, in from .optim import adam File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\optim_init_.py", line 5, in from .optim import adam File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\optim_init_.py", line 5, in from .adam import Adam, Adam8bit, Adam32bit File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\optim\adam.py", line 11, in from .adam import Adam, Adam8bit, Adam32bit File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\optim\adam.py", line 11, in from bitsandbytes.optim.optimizer import Optimizer2State File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\optim\optimizer.py", line 6, in from bitsandbytes.optim.optimizer import Optimizer2State File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\optim\optimizer.py", line 6, in import bitsandbytes.functional as F File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\functional.py", line 13, in import bitsandbytes.functional as F File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\bitsandbytes\functional.py", line 13, in lib = ct.cdll.LoadLibrary(os.path.dirname(file) + '/libbitsandbytes.so') File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\ctypes_init_.py", line 452, in LoadLibrary lib = ct.cdll.LoadLibrary(os.path.dirname(file) + '/libbitsandbytes.so') File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\ctypes_init_.py", line 452, in LoadLibrary return self.dlltype(name) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\ctypes_init.py", line 374, in init return self.dlltype(name) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\ctypes_init.py", line 374, in init self._handle = _dlopen(self._name, mode) OSError: [WinError 193] %1 不是有效的 Win32 应用程序。self._handle = _dlopen(self._name, mode)

OSError: [WinError 193] %1 不是有效的 Win32 应用程序。 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2104) of binary: D:\ProgramData\anaconda3\envs\chatglm_etuning\python.exe Traceback (most recent call last): File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\Scripts\accelerate.exe_main.py", line 7, in File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\accelerate\commands\launch.py", line 928, in launch_command multi_gpu_launcher(args) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\accelerate\commands\launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\ProgramData\anaconda3\envs\chatglm_etuning\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_sft.py FAILED

Failures: [1]: time : 2023-07-03_23:54:24 host : XTZJ-20210729YS exitcode : 1 (pid: 4312) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-07-03_23:54:24 host : XTZJ-20210729YS rank : 0 (local_rank: 0) exitcode : 1 (pid: 2104) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

ArtificialZeng avatar Jul 03 '23 16:07 ArtificialZeng

accelerate launch src/train_sft.py --config_file accelerate_config.yaml --do_train --dataset book_train_3 --finetuning_type lora --output_dir path_to_sft_checkpoint_5e_5 --per_device_train_batch_size 8 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 100.0 --fp16 --ddp_find_unused_parameters False 以上项目readme里面的脚本。根本就跑不通

ArtificialZeng avatar Jul 03 '23 16:07 ArtificialZeng

问题解决了吗?

KissMyLady avatar Aug 02 '23 07:08 KissMyLady