Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Sync 4 layer norms - bf16, fp32, optimizer states on restart

Open tjruwase opened this issue 3 years ago • 0 comments

this PR uses https://github.com/microsoft/DeepSpeed/pull/1801 @ d911e67 to sync layer norms:

  1. for bf16 weights
  2. for fp32 weights in bf16 optimizer
  3. for 2 optimizer states

all_reduce/OP.AVG is used in all 3 cases.

automatically works for all layers and all 4 types of layer norms - weights and biases.


this has been successfully applied to the live model and lead to layers getting in sync - but after some iterations some layers got out of sync again - some other bug to figure out.

tjruwase avatar Mar 28 '22 19:03 tjruwase