OLMo
OLMo copied to clipboard
Decoupled Momentum Optimization
Cleaned-up version of https://github.com/bloc97/DeMo for integrating efficient distributed training a la Decoupled Monentum Optimization (https://arxiv.org/abs/2411.19870)
Oh, I see. You put a reference in the description 🙈.
Paper says you pushed this to 1B/100B tokens. Can you go further? Experience says, things like this stop working if you go really big.