xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

单卡多机无法训练千问2.5

Open 1571859588 opened this issue 1 year ago • 3 comments

运行指令:

CUDA_VISIBLE_DEVICES=1,2,3,4 NPROC_PER_NODE=4 xtuner train ./internlm2_chat_1_8b_dpo_full_copy.py

internlm2_chat_1_8b_dpo_full_copy.py

这里我在示例的基础上把数据集和模型改成了千问2.5_32b的模型

运行部分结果:


W1209 06:53:34.745000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34366 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34367 closing signal SIGTERM
W1209 06:53:34.747000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 34368 closing signal SIGTERM
E1209 06:53:46.195000 139752507041600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 3 (pid: 34369) of binary: /mnt/public/conda/envs/xtuner/bin/python
Traceback (most recent call last):
  File "/mnt/public/conda/envs/xtuner/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/public/conda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-09_06:53:34
  host      : is-dahisl6olik7jjio-devmachine-0
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 34369)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 34369
============================================================

输出的日志文件出错部分内容:

2024/12/09 02:59:17 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.

尝试了的方法

  1. 将max_length改成了1024和512仍然不行,不知道是不是OOM,但是我试过跑6张A100 80G的卡都是同样的问题
  2. 跑单卡,单卡直接报错内存不足,没有抛出上面torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 的信息(其实也合理,毕竟没有使用多卡,也就没有分布式一说了)

现在我怀疑是不是XTuner不支持Qwen2.5了,看了仓库其他Issue似乎都没有跟我类似的问题,而之前有人问支不支持qwen也没有人回复。。。 求各位大佬帮帮忙!!!

1571859588 avatar Dec 09 '24 07:12 1571859588

llama-3.1-8B-instruct有同样的问题

siyuyuan avatar Dec 15 '24 07:12 siyuyuan

支持qwen2.5

Diyigelieren avatar Jan 15 '25 03:01 Diyigelieren

llama-3.1-8B-instruct有同样的问题

我本地微调llama3-8b也这样

zxjhellow2 avatar Jun 30 '25 06:06 zxjhellow2