litgpt Wrong epoch number on last line

Wrong epoch number on last line

Open rasbt opened this issue 1 year ago • 2 comments

The epoch number is increased in the last line before the training finishes so that it is no longer correct. It's a problem in all finetuning scripts:

Epoch 4 | iter 961 step 961 | loss train: 1.062, val: 1.057 | iter time: 529.46 ms (step)
Epoch 4 | iter 962 step 962 | loss train: 0.937, val: 1.057 | iter time: 503.53 ms (step)
Epoch 4 | iter 963 step 963 | loss train: 0.971, val: 1.057 | iter time: 522.10 ms (step)
Epoch 4 | iter 964 step 964 | loss train: 0.902, val: 1.057 | iter time: 115.27 ms (step)
Epoch 5 | iter 965 step 965 | loss train: 1.182, val: 1.057 | iter time: 743.31 ms (step)
Training time: 583.36s
Memory used: 14.49 GB
Saving LoRA weights to 'out/finetune/lora-tiny-llama-1.1b/final/lit_model.pth.lora'

Mar 13 '24 21:03 rasbt

Also for how many iterations does each epoch run?

Jul 15 '24 04:07 RidhiChhajer

Good question. The number of iterations depends on the batch size. I.e., one epoch means one full pass over the dataset. If you have a smaller batch size this will take more iterations per epoch.

Jul 15 '24 14:07 rasbt

litgpt litgpt copied to clipboard

Wrong epoch number on last line

litgpt
litgpt copied to clipboard