xtuner
xtuner copied to clipboard
单卡多机无法训练千问2.5
运行指令:
CUDA_VISIBLE_DEVICES=1,2,3,4 NPROC_PER_NODE=4 xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py
internlm2_chat_1_8b_dpo_full_copy.py
这里我在示例的基础上把数据集和模型改成了千问2.5_32b的模型
运行部分结果:
W1209 06:53:34.745000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34366 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34367 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34368 closing signal SIGTERM
E1209 06:53:46.195000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 3 (pid: 34369) of binary: /mnt/public/conda/envs/xtuner/bin/python
Traceback (most recent call last):
File "/mnt/public/conda/envs/xtuner/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-09_06:53:34
host : is-dahisl6olik7jjio-devmachine-0
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 34369)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 34369
============================================================
输出的日志文件出错部分内容:
2024/12/09 02:59:17 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
尝试了的方法
- 将max_length改成了1024和512仍然不行,不知道是不是OOM,但是我试过跑6张A100 80G的卡都是同样的问题
- 跑单卡,单卡直接报错内存不足,没有抛出上面
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:的信息(其实也合理,毕竟没有使用多卡,也就没有分布式一说了)
现在我怀疑是不是XTuner不支持Qwen2.5了,看了仓库其他Issue似乎都没有跟我类似的问题,而之前有人问支不支持qwen也没有人回复。。。 求各位大佬帮帮忙!!!
llama-3.1-8B-instruct有同样的问题
支持qwen2.5
llama-3.1-8B-instruct有同样的问题
我本地微调llama3-8b也这样