minimind icon indicating copy to clipboard operation
minimind copied to clipboard

SFT训练中断,sft_512.jsonl 文件是不是存在问题

Open powermano opened this issue 5 days ago • 6 comments

每次使用sft_512.jsonl 进行训练的时候, 训练到一个固定阶段,会直接断了, sft_1024.jsonl, sft_2048.jsonl 都是正常的

Epoch:[1/1](9300/70838) loss:1.703 lr:0.000052903582 0.244s/iters epoch_Time:251.0min:
[2025-02-21 16:44:34,662] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 52827 closing signal SIGTERM
[2025-02-21 16:44:34,663] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 52828 closing signal SIGTERM
[2025-02-21 16:44:36,532] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 52826) of binary: /opt/anaconda3/envs/pytoch1.13_cuda1
1.7/bin/python
Traceback (most recent call last):
  File "/opt/anaconda3/envs/pytoch1.13_cuda11.7/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')())
  File "/opt/anaconda3/envs/pytoch1.13_cuda11.7/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/opt/anaconda3/envs/pytoch1.13_cuda11.7/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/opt/anaconda3/envs/pytoch1.13_cuda11.7/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/anaconda3/envs/pytoch1.13_cuda11.7/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/anaconda3/envs/pytoch1.13_cuda11.7/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedErro

powermano avatar Feb 21 '25 15:02 powermano