rj42 comments

Results 8 comments of


                                            rj42

[bugfix] Force create checkpoint directory before saving dataloader state.

Hi @ETOgaosion, Let me provide more details about our setup: 1) **scripts**: We're running vanilla DaPo on Qwen2.5-32B with our own data using the script at https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh 2) **environment**: [Standard...

[bugfix] Force create checkpoint directory before saving dataloader state.

Hi @ETOgaosion, please check the PR.

[bugfix] Force create checkpoint directory before saving dataloader state.

Hi @ETOgaosion, It's been two weeks since this PR was opened: What's the plan for merging it? Seems like it might be stuck? On my end, I can add that...

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment

Hello, we have the same problem. **Model:** Qwen3-30B-A3B **Parallelism:** tensor_model_parallel_size: 4 expert_model_parallel_size: 1 expert_tensor_parallel_size: None pipeline_model_parallel_size: 2 virtual_pipeline_model_parallel_size: null context_parallel_size: 1 sequence_parallel: true

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment

> Please try using larger `expert_model_parallel_size=4` to reduce experts number to temporarily solve it. > > NVIDIA is looking into this bug, if you use megatron to pretraining a Qwen3-30B-A3B,...

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment

**mbridge** out of the box does not support training continuation and saving the optimizer state. dict_checkpoint doesn't work =( Maybe it makes sense to temporarily roll back dist_checkpoint or bring...

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment

@Yangruipis, hi, could you please share a working config for 671b?

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment

Sorry for the dumb question, but how do you deliver the required checkpoint slice to the host? Downloading the entire checkpoint would mean terabytes. For FSDP, we filter by rank....