transformers
transformers copied to clipboard
llm finetuning is overfitting?
So far all my attempts, with different models (bloom, gpt), sizes, accelerate framework, datasets have led to one issue: the evaluation loss keeps increasing. plz see my log (deepspeed)
Hard to really tell without specific dataset info, training procedure, and the model parameter count BUT:
I can't speak for your other attempts but this picture doesn't seem unusual. The eval loss decreases until epoch=0.94 but increases at epoch=1.25 and onwards. That implies that training is good for one epoch. Depending on the size of the dataset, models can easily start overfitting after one finetuning epoch (since it's just repeating the data). I assume this is finetuning, not pretraining?
Finetuning with adapters may work better.
Hi, @paulcx thanks for raising an issue!
This is probably a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports. If you suspect that the issue is coming from the library itself, could you follow the issue template and give more information about what is being run (environment and reproducible code snippet) so that we can best help you?
Hard to really tell without specific dataset info, training procedure, and the model parameter count BUT:
I can't speak for your other attempts but this picture doesn't seem unusual. The eval loss decreases until epoch=0.94 but increases at epoch=1.25 and onwards. That implies that training is good for one epoch. Depending on the size of the dataset, models can easily start overfitting after one finetuning epoch (since it's just repeating the data). I assume this is finetuning, not pretraining?
Finetuning with adapters may work better.
That's right. I'm trying finetuning. I knew pretraining and Lora finetuning works as expected. I just wonder if anyone have same issue. Does that mean one epoch is about overfitting? I saw a lot of open source projects and they finetuned 3 or 4 epoches with no explanation.
That's right. I'm trying finetuning. I knew pretraining and Lora finetuning works as expected. I just wonder if anyone have same issue. Does that mean one epoch is about overfitting? I saw a lot of open source projects and they finetuned 3 or 4 epoches with no explanation.
Yes, one epoch seems to be enough for this run. Going any further would likely require hyperparameter tuning and/or a larger dataset. Some of my models also begin overfitting after one finetuning epoch (around ~900k samples in my dataset - I don't know how large your dataset is).
Other projects may be using a different/larger dataset? Even if not, that's not too uncommon. They can finetune for a few more epochs than needed and then evaluate their checkpoints on a test set. The best performing checkpoint is then selected (which could be from a few epochs prior to the latest).
That's right. I'm trying finetuning. I knew pretraining and Lora finetuning works as expected. I just wonder if anyone have same issue. Does that mean one epoch is about overfitting? I saw a lot of open source projects and they finetuned 3 or 4 epoches with no explanation.
Yes, one epoch seems to be enough for this run. Going any further would likely require hyperparameter tuning and/or a larger dataset. Some of my models also begin overfitting after one finetuning epoch (around ~900k samples in my dataset - I don't know how large your dataset is).
Other projects may be using a different/larger dataset? Even if not, that's not too uncommon. They can finetune for a few more epochs than needed and then evaluate their checkpoints on a test set. The best performing checkpoint is then selected (which could be from a few epochs prior to the latest).
my dataset is only about 90K samples. one epoch 'theory' is quite interesting. It seems that people does not talk about this issue but ignoring overfitting.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.