transformers icon indicating copy to clipboard operation
transformers copied to clipboard

llm finetuning is overfitting?

Open paulcx opened this issue 2 years ago • 5 comments

So far all my attempts, with different models (bloom, gpt), sizes, accelerate framework, datasets have led to one issue: the evaluation loss keeps increasing. plz see my log (deepspeed)image

paulcx avatar Apr 22 '23 00:04 paulcx

Hard to really tell without specific dataset info, training procedure, and the model parameter count BUT:

I can't speak for your other attempts but this picture doesn't seem unusual. The eval loss decreases until epoch=0.94 but increases at epoch=1.25 and onwards. That implies that training is good for one epoch. Depending on the size of the dataset, models can easily start overfitting after one finetuning epoch (since it's just repeating the data). I assume this is finetuning, not pretraining?

Finetuning with adapters may work better.

SmartWashingMachine avatar Apr 23 '23 00:04 SmartWashingMachine

Hi, @paulcx thanks for raising an issue!

This is probably a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports. If you suspect that the issue is coming from the library itself, could you follow the issue template and give more information about what is being run (environment and reproducible code snippet) so that we can best help you?

amyeroberts avatar Apr 24 '23 10:04 amyeroberts

Hard to really tell without specific dataset info, training procedure, and the model parameter count BUT:

I can't speak for your other attempts but this picture doesn't seem unusual. The eval loss decreases until epoch=0.94 but increases at epoch=1.25 and onwards. That implies that training is good for one epoch. Depending on the size of the dataset, models can easily start overfitting after one finetuning epoch (since it's just repeating the data). I assume this is finetuning, not pretraining?

Finetuning with adapters may work better.

That's right. I'm trying finetuning. I knew pretraining and Lora finetuning works as expected. I just wonder if anyone have same issue. Does that mean one epoch is about overfitting? I saw a lot of open source projects and they finetuned 3 or 4 epoches with no explanation.

paulcx avatar Apr 24 '23 12:04 paulcx

That's right. I'm trying finetuning. I knew pretraining and Lora finetuning works as expected. I just wonder if anyone have same issue. Does that mean one epoch is about overfitting? I saw a lot of open source projects and they finetuned 3 or 4 epoches with no explanation.

Yes, one epoch seems to be enough for this run. Going any further would likely require hyperparameter tuning and/or a larger dataset. Some of my models also begin overfitting after one finetuning epoch (around ~900k samples in my dataset - I don't know how large your dataset is).

Other projects may be using a different/larger dataset? Even if not, that's not too uncommon. They can finetune for a few more epochs than needed and then evaluate their checkpoints on a test set. The best performing checkpoint is then selected (which could be from a few epochs prior to the latest).

SmartWashingMachine avatar Apr 25 '23 00:04 SmartWashingMachine

That's right. I'm trying finetuning. I knew pretraining and Lora finetuning works as expected. I just wonder if anyone have same issue. Does that mean one epoch is about overfitting? I saw a lot of open source projects and they finetuned 3 or 4 epoches with no explanation.

Yes, one epoch seems to be enough for this run. Going any further would likely require hyperparameter tuning and/or a larger dataset. Some of my models also begin overfitting after one finetuning epoch (around ~900k samples in my dataset - I don't know how large your dataset is).

Other projects may be using a different/larger dataset? Even if not, that's not too uncommon. They can finetune for a few more epochs than needed and then evaluate their checkpoints on a test set. The best performing checkpoint is then selected (which could be from a few epochs prior to the latest).

my dataset is only about 90K samples. one epoch 'theory' is quite interesting. It seems that people does not talk about this issue but ignoring overfitting.

paulcx avatar Apr 25 '23 12:04 paulcx

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 22 '23 15:05 github-actions[bot]