Chinese-Vicuna
Chinese-Vicuna copied to clipboard
bash finetune_continue.sh failed with 'RuntimeError: Trainer requires either a model or model_init argument'
Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s]
Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin
finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298
warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps))
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in Trainer
requires either a model
or model_init
│
│ 357 │ │ else: │
│ 358 │ │ │ if model_init is not None: │
│ 359 │ │ │ │ warnings.warn( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Trainer
requires either a model
or model_init
argument
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3
(gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s]
Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin
finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298
warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps))
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in Trainer
requires either a model
or model_init
│
│ 357 │ │ else: │
│ 358 │ │ │ if model_init is not None: │
│ 359 │ │ │ │ warnings.warn( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Trainer
requires either a model
or model_init
argument
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ub2004/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-04-29_00:06:42 host : ub2004-B85M-A0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 26013) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
This error looks like the mod was not successfully loaded.You can try the third one in the code here, whose title is "输出乱码问题". You can use this code to check if you can load model properly
Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh
Trainer``model``model_init``Trainer``model``model_init
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):
Trainer``model``model_init``Trainer``model``model_init
回溯(最近一次调用):
文件 “/home/ub2004/.local/bin/torchrun”,第 8 行,在 sys.exit(main()) 文件中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”,第 346 行,在包装器 中返回 f(*args, **kwargs) 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 794 行,在 main run(args)
中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 785 行,在运行 中 elastic_launch( 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 134 行,在调用 返回launch_agent(self._config, self._entrypoint, list(args))
文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 250 行,launch_agent 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py 失败
失败:
<NO_OTHER_FAILURES>
根本原因(第一次观察到的故障):
[0]: 时间 : 2023-04-29_00:06:42 主机 : ub2004-B85M-A0 等级 : 0 (local_rank: 0) 退出代码 : 1 (PID: 26013)
error_file: <不适用> 回溯 : 要启用回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html
请问这个问题您解决了嘛