Blue Space
Blue Space
### Describe the bug `compgen -g` command cause repeatable autosuggestions crash, tested on multiple machines. ### To Reproduce Steps to reproduce the behavior: 1. configure oh-my-zsh 2. add zsh-autosuggestions to...
Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed
This PR combines multiple modifications. # QWen2.5 checkpoint saver bug fix Thanks for the efforts @uygnef contributed to #368 , we use the new saver for model loader and saver...
**Describe the bug** p2p communication order error and stuck when pp 2 and vpp 2 with remove pad **To Reproduce** When use `PP=2` and `VPP=2` with `config.variable_seq_lengths=True`, `config.batch_p2p_comm=True` and `config.overlap_p2p_comm=False`,...
### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Fix ep bug and try to add CI with 15B model, finding smaller...
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Support lr scheduler in megatron ### High-Level Design Still got some difference with FSDP's...
# dist_checkpointing stuck on communication with MoE models in distributed environment Qwen 3 30B Moe models got stuck on all_reduce communication with dist_checkpoint. When running with 32 GPUs, it takes...