Amos optimizer support
🚀 The feature, motivation, and pitch
https://arxiv.org/abs/2210.11693
Amos reports better scaling (for multi accelerator) and better performance when compared to AdamW for autoregressive and masked language modeling. We should apply it to trlX and see if it helps speed up RLHF.
Alternatives
We could alternatively just stay with AdamW, which is very tried and tested.
Additional context
We need to seriously consider the wall time constraints of Amos and if it creates any serious optimization bottlenecks for us.
@LouisCastricato Would like to work on this. Will try to work on it over the weekend but might need some guidance.
Let us know if you run into any issues here or on the trlx channel.
Following up
Closing issue as inactivate.