Decoupled Momentum Optimization

Open peter-sk opened this issue 1 year ago • 1 comments

Cleaned-up version of https://github.com/bloc97/DeMo for integrating efficient distributed training a la Decoupled Monentum Optimization (https://arxiv.org/abs/2411.19870)

Dec 24 '24 12:12 peter-sk

Oh, I see. You put a reference in the description 🙈.

Paper says you pushed this to 1B/100B tokens. Can you go further? Experience says, things like this stop working if you go really big.

Feb 04 '25 01:02 dirkgr