[QUESTION] `--adam-beta2` in Mixtral 8x7B pretraining script

Open lishuai-97 opened this issue 7 months ago • 0 comments

Your question

The original implementation of Mixtral 8x7B sets --adam-beta2 to 0.999 by default during pretraining. However, empirical observations reveal an apparent trade-off: using β₂=0.999 consistently induces training instability (loss spikes in runs), while β₂=0.95 achieves stable convergence without spike events.

What empirical or theoretical justification supports choosing β₂=0.999 despite its observed instability? And does the default β₂ value potentially enable better final model quality at the cost of temporary instability?

May 15 '25 02:05 lishuai-97