nanoGPT Pretraining Divergence

Pretraining Divergence

Open egoetz opened this issue 8 months ago • 3 comments

I have been trying to follow the steps listed under "reproducing GPT-2" from the README.md. Unfortunately, when I run the model, my training always diverges. I have tried switching up my learning rate and gradient accumulation but neither of these tactics seemed to work, although I did have to fix a bug in the learning rate after varying those parameters. I could try changing those variables again, but my latest runs lead me to think that neither of those parameters are the issue:

Here are the last two runs. The orange run decays the learning rate over 300,000 steps while the pink run decays the learning rate over 600,000 steps. For these runs the learning rate starts at 6e-5 and hits its minimum at 6e-6.

Here are some of my meta-parameters: batch_size = 24 block_size = 1024 max_iters = 300000 lr_decay_iters = 300000 eval_interval = 1000 eval_iters = 200 log_interval = 100 weight_decay = 5e-2

I am running this model on 4 A100 80GB GPUs.

Jun 13 '24 18:06 egoetz

nanoGPT nanoGPT copied to clipboard

Pretraining Divergence

nanoGPT
nanoGPT copied to clipboard