保存完权重后,莫名奇妙的程序就停了,没有显示真实报错。并且报错时间不规律,有时候保存了几个权重后,才开始出现这种报错
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
model
model_name_or_path: /home/models/pretrained/Qwen3-8B-Base
trust_remote_code: true
method
stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_offload_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json,ds_z3_offload_config.json]
dataset
dataset: general_data
mask_history: False
template: qwen cutoff_len: 32000
max_samples: 1000
overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: /home/master/work/ logging_steps: 10 save_strategy: steps save_steps: 200 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 32 learning_rate: 2.0e-5 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.01 bf16: true flash_attn: fa2 ddp_timeout: 180000000
Reproduction
eckpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-05-14 14:25:51,391] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /home/master/work/LLM-Train/LLaMA-Factory-Y/saves/Qwen3/full/v15_align_0228_qwen3_8B_2e5_qtemp/checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-05-14 14:25:52,929] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step200 is ready now!
4%|▍ | 209/4694 [2:21:45<55:42:59, 44.72s/it]W0514 14:33:19.493000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 326 closing signal SIGTERM
W0514 14:33:19.524000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 327 closing signal SIGTERM
W0514 14:33:19.527000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 329 closing signal SIGTERM
W0514 14:33:19.528000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 330 closing signal SIGTERM
W0514 14:33:19.531000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 331 closing signal SIGTERM
W0514 14:33:19.533000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 332 closing signal SIGTERM
W0514 14:33:19.535000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 333 closing signal SIGTERM
E0514 14:33:36.041000 261 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -7) local_rank: 2 (pid: 328) of binary: /home/master/.python_libs/conda_env/bs0515/bin/python3.12
Traceback (most recent call last):
File "/home/master/.python_libs/conda_env/bs0515/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/master/.python_libs/conda_env/bs0515/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/master/work/LLM-Train/LLaMA-Factory-0515/src/llamafactory/launcher.py FAILED
Others
No response
failed (exitcode: -7) 看一下是不是CPU内存不够了
failed (exitcode: -7) 看一下是不是CPU内存不够了
我也遇到这个问题了,怀疑是保存期间有cpu内存的泄漏,每次保存会增加内存占用,目前还不知道应该怎么排查
https://github.com/deepspeedai/DeepSpeed/issues/3582 deepspeed zero3的问题
相同问题,deepspeed=0.16.9。同样的参数内存大的机子没问题内存小的挂了。
same problem...