Dirk Groeneveld
Dirk Groeneveld
- [x] Mitch (we already ran this anyways) - [x] Our own - [x] Kaiming We don't understand kaiming very well and it's not implemented yet, so this one is...
The theory is that the second moment goes to zero, resulting in a big update, which results in a loss spike. - [x] Generate some checkpoints closer to the spike...
The PaLM paper has a short section of tweaks to the vanilla Transformer architecture. We should make sure we have all of those.
One experiment is, let's just keep running the 7B and see if it recovers from the spikes.
https://github.com/allenai/LLM/blob/2118db56095157474fe1c69c1702db08af2d4f74/scripts/train.py#L187 I think having a checkpoint before any training happens would be quite useful.
We can't run in a debugger anymore.
Activation checkpointing needs to keep track of the state of the random number generator, which fails with `torch.compile()`. Rumor has it that the latest torch nightly has this fixed, so...
Maybe wandb will take care of this for us? I opened a ticket with them.
There are some checkpoints that we want to keep forever, because they are part of our output. The checkpoint saving code needs to know about those.