训练模型出错
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
按照视频指引,尝试着去训练一个模型,在点击“开始”后一段时间就报错了。请问这个是什么问题导致的,
E0512 17:04:53.622000 124936 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 6 (pid: 125052) of binary: /usr/local/python3/bin/python3.10
Traceback (most recent call last):
File "/usr/local/python3/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-05-12_17:04:53 host : localhost.localdomain rank : 6 (local_rank: 6) exitcode : 1 (pid: 125052) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/usr/local/python3/bin/llamafactory-cli", line 8, in
Reproduction
Put your message here.
Others
No response