nanoGPT finetuning with gpt2 always get loss nan

Hi, I tried to do finetuning with gpt2 with init_from = 'gpt2' . But it always gets loss nan since iter 1. And the out-shakespeare folder is always empty after iteration finish.

python train.py config/finetune_shakespeare.py

Overriding config with config/finetune_shakespeare.py:
import time

out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False # feel free to turn on
wandb_project = 'shakespeare'
wandb_run_name = 'ft-' + str(time.time())

dataset = 'shakespeare'
#init_from = 'gpt2-xl' # this is the largest GPT-2 model
init_from = 'gpt2' # this is the largest GPT-2 model

# only save checkpoints if the validation loss improves
always_save_checkpoint = False

# the number of examples per iter:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False

Initializing from OpenAI GPT-2 weights: gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 4.1872, val loss 4.0326
iter 0: loss 4.8131, time 36410.71ms, mfu -100.00%
iter 1: loss nan, time 13049.51ms, mfu -100.00%
iter 2: loss nan, time 13048.80ms, mfu -100.00%
iter 3: loss nan, time 13049.49ms, mfu -100.00%
iter 4: loss nan, time 13074.46ms, mfu -100.00%
step 5: train loss nan, val loss nan
iter 5: loss nan, time 13812.17ms, mfu 5.20%
iter 6: loss nan, time 13049.36ms, mfu 5.23%
iter 7: loss nan, time 13049.08ms, mfu 5.26%
iter 8: loss nan, time 13050.76ms, mfu 5.28%
iter 9: loss nan, time 13053.72ms, mfu 5.31%
step 10: train loss nan, val loss nan
iter 10: loss nan, time 13810.84ms, mfu 5.30%
iter 11: loss nan, time 13048.59ms, mfu 5.32%
iter 12: loss nan, time 13061.67ms, mfu 5.34%
iter 13: loss nan, time 13072.87ms, mfu 5.35%
iter 14: loss nan, time 13073.57ms, mfu 5.37%
step 15: train loss nan, val loss nan
iter 15: loss nan, time 13812.62ms, mfu 5.35%
iter 16: loss nan, time 13049.83ms, mfu 5.37%
iter 17: loss nan, time 13049.94ms, mfu 5.38%
iter 18: loss nan, time 13049.49ms, mfu 5.39%
iter 19: loss nan, time 13049.14ms, mfu 5.40%
step 20: train loss nan, val loss nan
iter 20: loss nan, time 13812.77ms, mfu 5.38%

Mar 20 '23 14:03 judyhappy

It is actually the pytorch 2.0 version. For testing I was running it right now with version 1.13.1 ( turning model compile off by adding compile=False to the finetune config)

It works. So what's wrong with pythorch 2.0 and Shakespeare finetuning?

Mar 20 '23 20:03 hmrc87

It is actually the pytorch 2.0 version. For testing I was running it right now with version 1.13.1 ( turning model compile off by adding compile=False to the finetune config)

It works. So what's wrong with pythorch 2.0 and Shakespeare finetuning?

I add compile = False into finetune_shakespeare.py but the loss is still nan.

@karpathy add Andrej to know this issue.

Mar 21 '23 02:03 judyhappy

I am sorry: I was not clear enough with my answer: I had to install pytorch 1.13.1 AND set compile to false in order to make it work.

I installed this version

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

and then set:

compile=False
dtype='float16'
from='gpt'

Mar 21 '23 07:03 hmrc87

There might be a couple of reasons for that:

the learning rate is too high - try 1e-5 or 3e-6
nan (or in general: invalid) values in your input
an overflow or underflow error - try to use decay LR and/or skip the backpropagation step if the loss is nan (with a next sample, the loss might be fine)

Here is PyTorch docs for the reference: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#loss-is-inf-nan

Mar 28 '23 16:03 chrisociepa

just added a PR for skipping steps when loss is nan

Mar 28 '23 16:03 chrisociepa

I had the same problem as @judyhappy. After trial and error I used --dtype='float32' with learning rate of 1e-5 and it seems to be working fine though the training is quite slow

Mar 29 '23 06:03 norvin2002

https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#loss-is-inf-nan

I am sorry: I was not clear enough with my answer: I had to install pytorch 1.13.1 AND set compile to false in order to make it work.

I installed this version

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

and then set:
compile=False
dtype='float16'
from='gpt'

but the flash attention can only be used when torch ==2.0, Flash Attention requires PyTorch >= 2.0

May 05 '23 07:05 yangdongchao

nanoGPT nanoGPT copied to clipboard

finetuning with gpt2 always get loss nan

nanoGPT
nanoGPT copied to clipboard