[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment
dist_checkpointing stuck on communication with MoE models in distributed environment
Qwen 3 30B Moe models got stuck on all_reduce communication with dist_checkpoint. When running with 32 GPUs, it takes 22 minutes to save checkpoint, and rank 0 takes extra long (36 minutes).
If we do not wrap with the FullyParallelSaveStrategyWrapper, will still stuck on all_gather operation.
[rank5]:[E627 12:26:40.073493187 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank5]:[E627 12:26:40.073594829 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 29 PG status: last enqueued work: 29, last completed work: 28
[rank5]:[E627 12:26:40.074084476 ProcessGroupNCCL.cpp:664] Stack trace of the failed collective:
#0 all_reduce from /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2806
#1 wrapper from /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:81
#2 sync_all_async_calls from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:149
#3 is_current_async_call_done from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:228
#4 maybe_finalize_async_calls from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:537
#5 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/base.py:228
#6 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/fully_parallel.py:95
#7 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/serialization.py:396
#8 save_dist_checkpointing from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/utils/megatron/dist_checkpointing.py:27
#9 save_checkpoint from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/utils/checkpoint/megatron_checkpoint_manager.py:356
#10 save_checkpoint from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/workers/megatron_workers.py:572
#11 inner from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/single_controller/base/decorator.py:540
#12 func from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/single_controller/ray/base.py:663
#13 _resume_span from /usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py:467
#14 actor_method_executor from /usr/local/lib/python3.10/dist-packages/ray/_private/function_manager.py:722
#15 main_loop from /usr/local/lib/python3.10/dist-packages/ray/_private/worker.py:892
#16 <module> from /usr/local/lib/python3.10/dist-packages/ray/_private/workers/default_worker.py:327
Qwen 3-32B dense model works fine, only saving in seconds.
@Yangruipis Have you tested our latest dist_checkpoint implementation in distributed environment? Do you have any idea of the problem with MoE model here? I didn't find much difference with megatron.training's implementation.
@Yangruipis Have you tested our latest dist_checkpoint implementation in distributed environment? Do you have any idea of the problem with MoE model here? I didn't find much difference with
megatron.training's implementation.
no, i still have not pull the latest code, what is your parallel settings? I will try to repoduce it today
Thanks~ My scripts (qwen3 moe nightly CI) here, I didn't enable ep, maybe I shall test it:
DATA_TRAIN_PATH=
DATA_VAL_PATH=
MODEL_PATH=
MNT_POINT=
python3 -m verl.trainer.main_ppo --config-path=config \
--config-name='ppo_megatron_trainer.yaml' \
algorithm.adv_estimator=grpo \
data.train_files="$DATA_TRAIN_PATH" \
data.val_files="$DATA_VAL_PATH" \
data.train_batch_size=64 \
data.max_prompt_length=1024 \
data.max_response_length=2048 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.context_parallel_size=2 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.param_offload=True \
actor_rollout_ref.actor.megatron.grad_offload=True \
actor_rollout_ref.actor.megatron.optimizer_offload=True \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.max_num_batched_tokens=10240 \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=4 \
actor_rollout_ref.ref.megatron.context_parallel_size=2 \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.ref.megatron.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k_math_nightly_ci' \
trainer.experiment_name='qwen3_30b_megatron' \
trainer.default_local_dir='$MNT_POINT' \
trainer.resume_mode='disable' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15
I've tested expert parallel, seems that when number of experts decreased, this issue disappears.
I can not reproduce it with moonlight16B (dpskv3 arch) on 4 nodes (pp4tp8ep8etp1), everything looks fine to me
I reproduced this timeout error with qwen3moe 30B, on 2 nodes with ep=2
set -x
# Paths
HF_MODEL_PATH=/root/myCodeLab/host/downloads/models/Qwen3-30B-A3B
DIST_CKPT_PATH=/root/myCodeLab/host/downloads/models/Qwen3-30B-A3B_DIST
TRAIN_FILE=/root/myCodeLab/host/downloads/datasets/dapo_data/dapo-math-17k.parquet
aime24_test_path=/root/myCodeLab/host/downloads/datasets/dapo_data/aime-2024.parquet
TEST_FILE="['$aime24_test_path']"
RUNTIME_ENV=${RUNTIME_ENV:-"${HOME}/myCodeLab/host/verl/my_scripts/my_runtime_env.yaml"}
python scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH --output_path $DIST_CKPT_PATH
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping
RAY_ADDRESS='auto' ray job submit --runtime-env="${RUNTIME_ENV}" -- \
python3 -m verl.trainer.main_ppo --config-path=config \
--config-name='ppo_megatron_trainer.yaml'\
algorithm.adv_estimator=grpo \
data.train_files="${TRAIN_FILE}" \
data.val_files="${TEST_FILE}" \
data.train_batch_size=64 \
data.max_prompt_length=1024 \
data.max_response_length=2048 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=$HF_MODEL_PATH \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=2 \
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=4 \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k_math' \
trainer.experiment_name='qwen3_30b_moe_megatron' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15
I meet the same question when save ckpt? have you solved it?
`Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/nfs-153/liyonghui/gsm8k/train.parquet', 'data.val_files=/nfs-153/liyonghui/gsm8k/test.parquet', 'data.train_batch_size=64', 'data.max_prompt_length=1024', 'data.max_response_length=2048', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'actor_rollout_ref.model.path=/itdd-pfs/liyonghui/models/Qwen/Qwen3-30B-A3B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.megatron.expert_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.use_dist_checkpointing=True', 'actor_rollout_ref.actor.megatron.dist_checkpointing_path=/itdd-pfs/liyonghui/models/Qwen/Qwen3-30B-A3B_dist_ckpt_mcore', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.ref.megatron.expert_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.use_dist_checkpointing=True', 'actor_rollout_ref.ref.megatron.dist_checkpointing_path=/itdd-pfs/liyonghui/models/Qwen/Qwen3-30B-A3B_dist_ckpt_mcore', 'algorithm.use_kl_in_reward=False', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=RL_verl_risk_decision', 'trainer.experiment_name=qwen3_30b_moe_megatron_debug_new', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=2', 'trainer.save_freq=5', 'trainer.test_freq=5', 'trainer.total_epochs=15', 'trainer.default_local_dir=/itdd-pfs/liyonghui/save_ckpt/RL_verl/gsm8k/V1/qwen3_30b_moe_megatron_debug_new'] Traceback (most recent call last): File "/nfs-153/liyonghui/RL/verl_test_2025_07_09/dxm/rpa_algorithm_group/verl/verl/trainer/main_ppo.py", line 39, in main run_ppo(config) File "/nfs-153/liyonghui/RL/verl_test_2025_07_09/dxm/rpa_algorithm_group/verl/verl/trainer/main_ppo.py", line 69, in run_ppo ray.get(runner.run.remote(config)) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2822, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 930, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorUnavailableError): [36mray::TaskRunner.run()[39m (pid=11620, ip=10.25.0.28, actor_id=fa79fd5514a67d15bf3c41ee01000000, repr=<main_ppo.TaskRunner object at 0x7f6d73319810>) File "/nfs-153/liyonghui/RL/verl_test_2025_07_09/dxm/rpa_algorithm_group/verl/verl/trainer/main_ppo.py", line 234, in run trainer.fit() File "/nfs-153/liyonghui/RL/verl_test_2025_07_09/dxm/rpa_algorithm_group/verl/verl/trainer/ppo/ray_trainer.py", line 1525, in fit self._save_checkpoint() File "/nfs-153/liyonghui/RL/verl_test_2025_07_09/dxm/rpa_algorithm_group/verl/verl/trainer/ppo/ray_trainer.py", line 1054, in _save_checkpoint self.actor_rollout_wg.save_checkpoint( File "/nfs-153/liyonghui/RL/verl_test_2025_07_09/dxm/rpa_algorithm_group/verl/verl/single_controller/ray/base.py", line 51, in call output = ray.get(output) ray.exceptions.ActorUnavailableError: The actor 2798a12845284d28784a705301000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: keepalive watchdog timeout; RPC Error details: rpc_code: 14. The task may or maynot have been executed on the actor.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m [rank11]:[E710 04:53:05.583801753 ProcessGroupNCCL.cpp:629] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=46, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out.[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m [rank11]:[E710 04:53:05.583872815 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 11] failure detected by watchdog at work sequence id: 46 PG status: last enqueued work: 46, last completed work: 45[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m [rank11]:[E710 04:53:05.583884990 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m [rank11]:[E710 04:53:05.583889748 ProcessGroupNCCL.cpp:681] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m [rank11]:[E710 04:53:05.583893996 ProcessGroupNCCL.cpp:695] [Rank 11] To avoid data inconsistency, we are taking the entire process down.[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m [rank11]:[E710 04:53:05.585305874 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=46, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out.[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):[32m [repeated 28x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f298ddfb1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)[32m [repeated 42x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2937b9ac74 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)[32m [repeated 28x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2937b9d6ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)[32m [repeated 56x across cluster][0m
[36m(WorkerDict pid=3851, ip=10.25.0.39)[0m frame #4:
same too
Hello, we have the same problem.
Model: Qwen3-30B-A3B
Parallelism: tensor_model_parallel_size: 4 expert_model_parallel_size: 1 expert_tensor_parallel_size: None pipeline_model_parallel_size: 2 virtual_pipeline_model_parallel_size: null context_parallel_size: 1 sequence_parallel: true
Please try using larger expert_model_parallel_size=4 to reduce experts number to temporarily solve it.
NVIDIA is looking into this bug, if you use megatron to pretraining a Qwen3-30B-A3B, you will also get this bug.
Please try using larger
expert_model_parallel_size=4to reduce experts number to temporarily solve it.NVIDIA is looking into this bug, if you use megatron to pretraining a Qwen3-30B-A3B, you will also get this bug.
Hi, thanks for the advice.
But we already tried these parameters (nodes=32):
tensor_model_parallel_size: 4 expert_model_parallel_size: 4 expert_tensor_parallel_size: null pipeline_model_parallel_size: 1 virtual_pipeline_model_parallel_size: null context_parallel_size: 1 sequence_parallel: true
and still got stuck during saving.
Notice that mbridge is also supported for now, please try to use it @rj42 .
Actually there are more bugs in dist_checkpointing than we thought, dist_checkpointing cannot work for all configurations, vpp seems not support. Maybe mbridge desires more attention.
mbridge out of the box does not support training continuation and saving the optimizer state. dict_checkpoint doesn't work =( Maybe it makes sense to temporarily roll back dist_checkpoint or bring back support for the old mechanism?
嗯我们也遇到了,现在671b训练任务是只开了model的dist_checkpointing save,opt还是torch.save,目前在稳定训练
@Yangruipis, hi, could you please share a working config for 671b?
@Yangruipis, hi, could you please share a working config for 671b?
you should hack the code i'm afraid, this is not configurable
I just executed a load-train 1 step grpo-save cycle for the Qwen3 235B model using mbridge on a cluster of eight H200 nodes, each equipped with 2TB of memory. its more memory-efficient and faster for saving checkpoints (without the optimizer state) compared to the dist_checkpointing method in my case.
Sorry for the dumb question, but how do you deliver the required checkpoint slice to the host? Downloading the entire checkpoint would mean terabytes. For FSDP, we filter by rank. But how do you pull this off in Megatron? Could you please advise?
@Yangruipis, hi, could you please share a working config for 671b?
you should hack the code i'm afraid, this is not configurable
hi, could you share the code changes needed to make it work with 671b? Any pointers would be greatly appreciated!
你解决了么,这个问题我也遇到了,保存代码的时候出错了