returnn
returnn copied to clipboard
SlowMo (BMUF) support for PyTorch distributed training
This is for the parameter averaging method in distributed training. The SlowMo method adds an additional momentum which is used for the outer loop updates (i.e. after param averaging).
- Wang et al., “SlowMo.”, ICLR 2020. Arxiv, OpenReview.
Original fairscale code. Code also in Fairseq.
The method is actually conceptually the same as BMUF. Only some of the experiments in the SlowMo paper go a bit beyond that.
- Chen and Huo, “Scalable Training of Deep Learning Machines by Incremental Block Training with Intra-Block Parallel Optimization and Blockwise Model-Update Filtering.” (BMUF), ICASSP 2016