Chinese-Vicuna icon indicating copy to clipboard operation
Chinese-Vicuna copied to clipboard

bash finetune_continue.sh failed with 'RuntimeError: Trainer requires either a model or model_init argument'

Open SeekPoint opened this issue 1 year ago • 2 comments

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError("Trainer requires either a model or model_init │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trainer requires either a model or model_init argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.sh

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError("Trainer requires either a model or model_init │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trainer requires either a model or model_init argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):

Traceback (most recent call last): File "/home/ub2004/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-29_00:06:42 host : ub2004-B85M-A0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 26013) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

SeekPoint avatar Apr 28 '23 16:04 SeekPoint

This error looks like the mod was not successfully loaded.You can try the third one in the code here, whose title is "输出乱码问题". You can use this code to check if you can load model properly

Facico avatar May 04 '23 03:05 Facico

Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 (gh_Chinese-Vicuna) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Vicuna$ bash finetune_continue.shTrainer``model``model_init``Trainer``model``model_init

===================================BUG REPORT===================================

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues Found cached dataset json (/home/ub2004/.cache/huggingface/datasets/json/default-6eef2a44d8479e8f/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.39it/s] Restarting from ./lora-Vicuna/checkpoint-11600/pytorch_model.bin finetune.py:125: UserWarning: epoch 3 replace to the base_max_steps 17298 warnings.warn("epoch {} replace to the base_max_steps {}".format(EPOCHS, base_max_steps)) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ub2004/llm_dev/Chinese-Vicuna/finetune.py:235 in │ │ │ │ 232 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 233 │ val_data = None │ │ 234 │ │ ❱ 235 trainer = transformers.Trainer( │ │ 236 │ model=model, │ │ 237 │ train_dataset=train_data, │ │ 238 │ eval_dataset=val_data, │ │ │ │ /home/ub2004/.local/lib/python3.8/site-packages/transformers/trainer.py:356 in init │ │ │ │ 353 │ │ │ │ self.model_init = model_init │ │ 354 │ │ │ │ model = self.call_model_init() │ │ 355 │ │ │ else: │ │ ❱ 356 │ │ │ │ raise RuntimeError(" requires either a or │ │ 357 │ │ else: │ │ 358 │ │ │ if model_init is not None: │ │ 359 │ │ │ │ warnings.warn( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: requires either a or argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26013) of binary: /usr/bin/python3 Traceback (most recent call last):Trainer``model``model_init``Trainer``model``model_init

回溯(最近一次调用):

文件 “/home/ub2004/.local/bin/torchrun”,第 8 行,在 sys.exit(main()) 文件中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”,第 346 行,在包装器 中返回 f(*args, **kwargs) 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 794 行,在 main run(args)

中 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/run.py”,第 785 行,在运行 中 elastic_launch( 文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 134 行,在调用 返回launch_agent(self._config, self._entrypoint, list(args))

文件 “/home/ub2004/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py”,第 250 行,launch_agent 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py 失败

失败:

<NO_OTHER_FAILURES>

根本原因(第一次观察到的故障):

[0]: 时间 : 2023-04-29_00:06:42 主机 : ub2004-B85M-A0 等级 : 0 (local_rank: 0) 退出代码 : 1 (PID: 26013)

error_file: <不适用> 回溯 : 要启用回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html

请问这个问题您解决了嘛

YSLLYW avatar May 20 '23 03:05 YSLLYW