Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Sync 4 layer norms - bf16, fp32, optimizer states on restart
this PR uses https://github.com/microsoft/DeepSpeed/pull/1801 @ d911e67 to sync layer norms:
- for bf16 weights
- for fp32 weights in bf16 optimizer
- for 2 optimizer states
all_reduce/OP.AVG is used in all 3 cases.
automatically works for all layers and all 4 types of layer norms - weights and biases.
this has been successfully applied to the live model and lead to layers getting in sync - but after some iterations some layers got out of sync again - some other bug to figure out.