dlrover dlrover配合pai-megatron-patch启用flash checkpoint报错

1、组件版本 dlrover: 0.4.0 pai-megatron-patch: v0.10.3

2、问题说明在2台8*H20 GPU节点上，对llama3.1-70B模型进行预训练，并行策略为TP=8、PP=2、DP=1，每训练30个迭代保存一次checkpoint，出现checkpoint（包括权重和优化器）保存成功，但flash checkpoint执行结果显示未成功 2.1 training.py文件修改 #from megatron.training.checkpointing import load_checkpoint, save_checkpoint from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint

2.2 通过megatron-lm框架启动训练，megatron启动参数: megatron_options="
--save ${SAVED_PRETRAIN_CHECKPOINT_PATH}
--lr ${LR}
--min-lr ${MIN_LR}
--lr-decay-style cosine
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--clip-grad 1.0
--init-method-std 0.008
--attention-dropout 0.0
--hidden-dropout 0.0
--lr-decay-iters ${LR_DECAY_ITERS}
--lr-warmup-iters ${LR_WARMUP_ITERS}
--train-iters ${TRAIN_ITERS}
--micro-batch-size ${BATCH_SIZE}
--global-batch-size ${GLOBAL_BATCH_SIZE}
--num-layers ${NUM_LAYERS}
--hidden-size ${HIDDEN_SIZE}
--num-attention-heads ${NUM_ATTN_HEADS}
--ffn-hidden-size ${INTERMEDIATE_SIZE}
--seq-length ${SEQ_LEN}
--max-position-embeddings ${MAX_POSITION_EMBEDDINGS}
--max-padding-length ${PAD_LEN}
--log-interval 1
--log-throughput
--eval-interval 10000
--eval-iters 10
--save-interval ${SAVE_INTERVAL}
--tensorboard-queue-size 1
--tensorboard-dir ${TENSORBOARD_DIR}
--log-timers-to-tensorboard
--log-batch-size-to-tensorboard
--log-validation-ppl-to-tensorboard
--tensor-model-parallel-size ${TP}
--pipeline-model-parallel-size ${PP}
--context-parallel-size ${CP}
--num-workers 8
--extra-vocab-size ${EXTRA_VOCAB_SIZE}
--patch-tokenizer-type LLama3Tokenizer
--swiglu
--normalization RMSNorm
--norm-epsilon 1e-05
--use-rotary-position-embeddings
--position-embedding-type rope
--untie-embeddings-and-output-weights
--disable-bias-linear
--rotary-base 500000
*--use-dist-ckpt
--dist-ckpt-format torch_dist * "

2.3 训练脚本: sh run_mcore_llama3_1.sh dsw 70B 1 32 1e-5 1e-6 128 128 bf16 8 2 1 true true true false false false 30 /mnt/llama3-datasets/wudao_llama3bpe_content_document /mnt/llama3-datasets/wudao_llama3bpe_content_document /mnt/llama3-ckpts/Meta-Llama-3.1-70B/mcore-tp8-pp2 100000000 100000 /mnt/output_mcore_llama3_1

2.4 训练容器报错日志: 实例1: [2025-03-20 11:17:01,095] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 7 of rank 7. [2025-03-20 11:17:01,546] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:01,546] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:01,551] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 4 of rank 4. [2025-03-20 11:17:02,248] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:02,249] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:02,256] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 5 of rank 5. [2025-03-20 11:17:02,607] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:02,608] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:02,626] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 6 of rank 6. [2025-03-20 11:17:03,295] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:03,295] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:03,301] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 2 of rank 2. [2025-03-20 11:17:03,464] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:03,464] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:03,490] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 3 of rank 3. [2025-03-20 11:17:03,531] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:03,531] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:03,537] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 1 of rank 1. [2025-03-20 11:17:03,938] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:03,939] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:03,944] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 0 of rank 0. [2025-03-20 11:17:03,952] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 11 != 16. [2025-03-20 11:17:08,956] [ERROR] [ckpt_saver.py:615:_sync_shm_to_storage] Got unexpected exception during checkpointing: CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=60, global_shard_num=0), error: [Errno 2] No such file or directory: '/mnt/output_mcore_llama3_1/checkpoint/pretrain-mcore-llama3-1-70B-lr-1e-5-minlr-1e-6-bs-1-gbs-32-seqlen-128-pr-bf16-tp-8-pp-2-cp-1-ac-false-do-true-sp-true-ti-24414-wi-24/._dlrover_ckpt_stage/60.done'. Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 611, in _sync_shm_to_storage self.save_step_checkpoint(event.step) File "/usr/local/lib/python3.12/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 1017, in save_step_checkpoint self.commit_checkpoint( File "/usr/local/lib/python3.12/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 1044, in commit_checkpoint done_files = self.storage.listdir(step_done_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/dlrover/python/common/storage.py", line 185, in listdir return os.listdir(path) ^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: '/mnt/output_mcore_llama3_1/checkpoint/pretrain-mcore-llama3-1-70B-lr-1e-5-minlr-1e-6-bs-1-gbs-32-seqlen-128-pr-bf16-tp-8-pp-2-cp-1-ac-false-do-true-sp-true-ti-24414-wi-24/._dlrover_ckpt_stage/60.done' [2025-03-20 11:17:08,961] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:17:08,961] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:17:08,961] [WARNING] [ckpt_saver.py:641:_report_failure_to_master] Failed to report failure to master in ckpt saver: 'NoneType' object has no attribute 'report_failures'.

实例2: [2025-03-20 11:11:36,280] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 5 of rank 13. [2025-03-20 11:11:36,408] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:36,408] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:36,426] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 7 of rank 15. [2025-03-20 11:11:37,022] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:37,022] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:37,028] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 2 of rank 10. [2025-03-20 11:11:37,045] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:37,045] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:37,049] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 4 of rank 12. [2025-03-20 11:11:37,206] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:37,207] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:37,212] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 6 of rank 14. [2025-03-20 11:11:37,301] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:37,301] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:37,307] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 0 of rank 8. [2025-03-20 11:11:37,536] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:37,536] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:37,544] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 3 of rank 11. [2025-03-20 11:11:38,068] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:38,068] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:38,074] [INFO] [ckpt_saver.py:718:_save_shard] Finish saving the checkpoint shard 1 of rank 9. [2025-03-20 11:11:38,081] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 14 != 16. [2025-03-20 11:11:41] iteration 36/ 24414 | consumed samples: 1152 | elapsed time per iteration (ms): 10442.4 | throughput per GPU (TFLOP/s/GPU): 10.2 | learning rate: 9.999995E-06 | global batch size: 32 | lm loss: 2.778350E+00 | loss scale: 1.0 | grad norm: 1336.631 | number of skipped iterations: 0 | number of nan iterations: 0 | [2025-03-20 11:11:43,086] [ERROR] [ckpt_saver.py:615:_sync_shm_to_storage] Got unexpected exception during checkpointing: CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=30, global_shard_num=0), error: [Errno 2] No such file or directory: '/mnt/output_mcore_llama3_1/checkpoint/pretrain-mcore-llama3-1-70B-lr-1e-5-minlr-1e-6-bs-1-gbs-32-seqlen-128-pr-bf16-tp-8-pp-2-cp-1-ac-false-do-true-sp-true-ti-24414-wi-24/._dlrover_ckpt_stage/30.done'. Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 611, in _sync_shm_to_storage self.save_step_checkpoint(event.step) File "/usr/local/lib/python3.12/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 1017, in save_step_checkpoint self.commit_checkpoint( File "/usr/local/lib/python3.12/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 1044, in commit_checkpoint done_files = self.storage.listdir(step_done_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/dlrover/python/common/storage.py", line 185, in listdir return os.listdir(path) ^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: '/mnt/output_mcore_llama3_1/checkpoint/pretrain-mcore-llama3-1-70B-lr-1e-5-minlr-1e-6-bs-1-gbs-32-seqlen-128-pr-bf16-tp-8-pp-2-cp-1-ac-false-do-true-sp-true-ti-24414-wi-24/._dlrover_ckpt_stage/30.done' [2025-03-20 11:11:43,097] [INFO] [master_client.py:528:build_master_client] set master_client timeout to 180 [2025-03-20 11:11:43,097] [INFO] [master_client.py:531:build_master_client] Build master client with addr . [2025-03-20 11:11:43,097] [WARNING] [ckpt_saver.py:641:_report_failure_to_master] Failed to report failure to master in ckpt saver: 'NoneType' object has no attribute 'report_failures'.

Mar 20 '25 13:03 yizhouv5

It is now recommended to directly use the checkpoint implementation from the latest version of Megatron. This is because dlrover's integration with Megatron's flash checkpoint implementation is based on version 0.6 (which is outdated), whereas the latest version of Megatron's implementation is relatively more mature.

Apr 30 '25 03:04 BalaBalaYi

This issue has been automatically marked as stale because it has not had recent activity.

Jul 30 '25 02:07 github-actions[bot]

This issue is being automatically closed due to inactivity.

Aug 06 '25 02:08 github-actions[bot]