rj42
rj42
Hi @ETOgaosion, Let me provide more details about our setup: 1) **scripts**: We're running vanilla DaPo on Qwen2.5-32B with our own data using the script at https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh 2) **environment**: [Standard...
Hi @ETOgaosion, please check the PR.
Hi @ETOgaosion, It's been two weeks since this PR was opened: What's the plan for merging it? Seems like it might be stuck? On my end, I can add that...
Hello, we have the same problem. **Model:** Qwen3-30B-A3B **Parallelism:** tensor_model_parallel_size: 4 expert_model_parallel_size: 1 expert_tensor_parallel_size: None pipeline_model_parallel_size: 2 virtual_pipeline_model_parallel_size: null context_parallel_size: 1 sequence_parallel: true
> Please try using larger `expert_model_parallel_size=4` to reduce experts number to temporarily solve it. > > NVIDIA is looking into this bug, if you use megatron to pretraining a Qwen3-30B-A3B,...
**mbridge** out of the box does not support training continuation and saving the optimizer state. dict_checkpoint doesn't work =( Maybe it makes sense to temporarily roll back dist_checkpoint or bring...
@Yangruipis, hi, could you please share a working config for 671b?
Sorry for the dumb question, but how do you deliver the required checkpoint slice to the host? Downloading the entire checkpoint would mean terabytes. For FSDP, we filter by rank....