nanoGPT
nanoGPT copied to clipboard
finetuning with gpt2 always get loss nan
Hi, I tried to do finetuning with gpt2 with init_from = 'gpt2' . But it always gets loss nan since iter 1. And the out-shakespeare folder is always empty after iteration finish.
python train.py config/finetune_shakespeare.py
Overriding config with config/finetune_shakespeare.py:
import time
out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False # feel free to turn on
wandb_project = 'shakespeare'
wandb_run_name = 'ft-' + str(time.time())
dataset = 'shakespeare'
#init_from = 'gpt2-xl' # this is the largest GPT-2 model
init_from = 'gpt2' # this is the largest GPT-2 model
# only save checkpoints if the validation loss improves
always_save_checkpoint = False
# the number of examples per iter:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20
# finetune at constant LR
learning_rate = 3e-5
decay_lr = False
Initializing from OpenAI GPT-2 weights: gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 4.1872, val loss 4.0326
iter 0: loss 4.8131, time 36410.71ms, mfu -100.00%
iter 1: loss nan, time 13049.51ms, mfu -100.00%
iter 2: loss nan, time 13048.80ms, mfu -100.00%
iter 3: loss nan, time 13049.49ms, mfu -100.00%
iter 4: loss nan, time 13074.46ms, mfu -100.00%
step 5: train loss nan, val loss nan
iter 5: loss nan, time 13812.17ms, mfu 5.20%
iter 6: loss nan, time 13049.36ms, mfu 5.23%
iter 7: loss nan, time 13049.08ms, mfu 5.26%
iter 8: loss nan, time 13050.76ms, mfu 5.28%
iter 9: loss nan, time 13053.72ms, mfu 5.31%
step 10: train loss nan, val loss nan
iter 10: loss nan, time 13810.84ms, mfu 5.30%
iter 11: loss nan, time 13048.59ms, mfu 5.32%
iter 12: loss nan, time 13061.67ms, mfu 5.34%
iter 13: loss nan, time 13072.87ms, mfu 5.35%
iter 14: loss nan, time 13073.57ms, mfu 5.37%
step 15: train loss nan, val loss nan
iter 15: loss nan, time 13812.62ms, mfu 5.35%
iter 16: loss nan, time 13049.83ms, mfu 5.37%
iter 17: loss nan, time 13049.94ms, mfu 5.38%
iter 18: loss nan, time 13049.49ms, mfu 5.39%
iter 19: loss nan, time 13049.14ms, mfu 5.40%
step 20: train loss nan, val loss nan
iter 20: loss nan, time 13812.77ms, mfu 5.38%
It is actually the pytorch 2.0 version. For testing I was running it right now with version 1.13.1 ( turning model compile off by adding compile=False to the finetune config)
It works. So what's wrong with pythorch 2.0 and Shakespeare finetuning?
It is actually the pytorch 2.0 version. For testing I was running it right now with version 1.13.1 ( turning model compile off by adding compile=False to the finetune config)
It works. So what's wrong with pythorch 2.0 and Shakespeare finetuning?
I add compile = False into finetune_shakespeare.py but the loss is still nan.
@karpathy add Andrej to know this issue.
I am sorry: I was not clear enough with my answer: I had to install pytorch 1.13.1 AND set compile to false in order to make it work.
I installed this version
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
and then set:
compile=False
dtype='float16'
from='gpt'
There might be a couple of reasons for that:
- the learning rate is too high - try
1e-5
or3e-6
- nan (or in general: invalid) values in your input
- an overflow or underflow error - try to use decay LR and/or skip the backpropagation step if the loss is nan (with a next sample, the loss might be fine)
Here is PyTorch docs for the reference: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#loss-is-inf-nan
just added a PR for skipping steps when loss is nan
I had the same problem as @judyhappy. After trial and error I used --dtype='float32' with learning rate of 1e-5 and it seems to be working fine though the training is quite slow
https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#loss-is-inf-nan
I am sorry: I was not clear enough with my answer: I had to install pytorch 1.13.1 AND set compile to false in order to make it work.
I installed this version
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
and then set:
compile=False dtype='float16' from='gpt'
but the flash attention can only be used when torch ==2.0, Flash Attention requires PyTorch >= 2.0