Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] `--adam-beta2` in Mixtral 8x7B pretraining script
Your question
The original implementation of Mixtral 8x7B sets --adam-beta2 to 0.999 by default during pretraining. However, empirical observations reveal an apparent trade-off: using β₂=0.999 consistently induces training instability (loss spikes in runs), while β₂=0.95 achieves stable convergence without spike events.
What empirical or theoretical justification supports choosing β₂=0.999 despite its observed instability? And does the default β₂ value potentially enable better final model quality at the cost of temporary instability?