OLMo
OLMo copied to clipboard
Update clipping
The theory is that the second moment goes to zero, resulting in a big update, which results in a loss spike.
- [x] Generate some checkpoints closer to the spike
- [x] Implement extra logging so we can make sure that this is actually what happens
- [ ] Implement update clipping with a maximum per-parameter update norm of 1. Same as Adafactor: https://github.com/google-research/t5x/blob/03dfc44be7f9a93d34c1d7fd6f896d1c364a7d4d/t5x/adafactor.py#L470C1-L476C26
- [ ] Ablate it