Dirk Groeneveld issues

Results 84 issues of


                                            Dirk Groeneveld

Ablate Kaiming init vs Mitch init vs our own init at 7B

- [x] Mitch (we already ran this anyways) - [x] Our own - [x] Kaiming We don't understand kaiming very well and it's not implemented yet, so this one is...

project/model

compute/Mosaic

Update clipping

The theory is that the second moment goes to zero, resulting in a big update, which results in a loss spike. - [x] Generate some checkpoints closer to the spike...

project/model

Stuff from PaLM that we don't have yet

The PaLM paper has a short section of tweaks to the vanilla Transformer architecture. We should make sure we have all of those.

project/model

severity/should

Continue running the 7B

One experiment is, let's just keep running the 7B and see if it recovers from the spikes.

project/model

compute/IB

We should be storing a checkpoint of the untrained model

https://github.com/allenai/LLM/blob/2118db56095157474fe1c69c1702db08af2d4f74/scripts/train.py#L187 I think having a checkpoint before any training happens would be quite useful.

We can no longer run on a single CPU

We can't run in a debugger anymore.

project/model

severity/should

difficulty/medium

`Torch.compile()` and activation checkpointing are incompatible

Activation checkpointing needs to keep track of the state of the random number generator, which fails with `torch.compile()`. Rumor has it that the latest torch nightly has this fixed, so...

Get GPU metrics into wandb

Maybe wandb will take care of this for us? I opened a ticket with them.

project/model

severity/should

status/blocked

Checkpoint saving code needs to never delete or overwrite certain checkpoints

There are some checkpoints that we want to keep forever, because they are part of our output. The checkpoint saving code needs to know about those.

project/model

severity/must

difficulty/medium

Find out what running with a profiler is like

project/model

severity/must

difficulty/medium