LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

保存完权重后,莫名奇妙的程序就停了,没有显示真实报错。并且报错时间不规律,有时候保存了几个权重后,才开始出现这种报错

Open hnjzbss opened this issue 7 months ago • 1 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

model

model_name_or_path: /home/models/pretrained/Qwen3-8B-Base

trust_remote_code: true

method

stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_offload_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json,ds_z3_offload_config.json]

dataset

dataset: general_data

mask_history: False

template: qwen cutoff_len: 32000

max_samples: 1000

overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /home/master/work/ logging_steps: 10 save_strategy: steps save_steps: 200 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 32 learning_rate: 2.0e-5 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.01 bf16: true flash_attn: fa2 ddp_timeout: 180000000

Reproduction

eckpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-05-14 14:25:51,391] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /home/master/work/LLM-Train/LLaMA-Factory-Y/saves/Qwen3/full/v15_align_0228_qwen3_8B_2e5_qtemp/checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-05-14 14:25:52,929] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step200 is ready now!
  4%|▍         | 209/4694 [2:21:45<55:42:59, 44.72s/it]W0514 14:33:19.493000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 326 closing signal SIGTERM
W0514 14:33:19.524000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 327 closing signal SIGTERM
W0514 14:33:19.527000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 329 closing signal SIGTERM
W0514 14:33:19.528000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 330 closing signal SIGTERM
W0514 14:33:19.531000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 331 closing signal SIGTERM
W0514 14:33:19.533000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 332 closing signal SIGTERM
W0514 14:33:19.535000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 333 closing signal SIGTERM
E0514 14:33:36.041000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -7) local_rank: 2 (pid: 328) of binary: /home/master/.python_libs/conda_env/bs0515/bin/python3.12
Traceback (most recent call last):
  File "/home/master/.python_libs/conda_env/bs0515/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/master/work/LLM-Train/LLaMA-Factory-0515/src/llamafactory/launcher.py FAILED

Others

No response

hnjzbss avatar May 14 '25 06:05 hnjzbss

failed (exitcode: -7) 看一下是不是CPU内存不够了

Kuangdd01 avatar May 14 '25 13:05 Kuangdd01

failed (exitcode: -7) 看一下是不是CPU内存不够了

我也遇到这个问题了,怀疑是保存期间有cpu内存的泄漏,每次保存会增加内存占用,目前还不知道应该怎么排查

jiajunly avatar Aug 13 '25 03:08 jiajunly

https://github.com/deepspeedai/DeepSpeed/issues/3582 deepspeed zero3的问题

jiajunly avatar Aug 13 '25 03:08 jiajunly

相同问题,deepspeed=0.16.9。同样的参数内存大的机子没问题内存小的挂了。

amoyplane avatar Aug 22 '25 02:08 amoyplane

same problem...

alex-wll avatar Sep 12 '25 03:09 alex-wll