starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

Training Loss vs Evaluation Loss during Fine Tuning Star Coder

Open ruchaa0112 opened this issue 1 year ago • 10 comments

Hi @ArmelRandy and @loubnabnl

I am fine-tuning star coder on my custom dataset and was monitoring the training and validation loss. The training loss seems to decrease however in case of eval loss , the loss seems to increase. Can you please help me with this ? This is the how the training loss looks like - image

This is how the eval loss looks like - image

Please let me know if any more information is needed from my end.

ruchaa0112 avatar Jul 10 '23 19:07 ruchaa0112

Hi @ruchaa0112, can you share the command you used to launch the training? And the distribution of your dataset in terms of size (training set size vs validation set size) ? It is likely that you are doing to much steps, which can lead to your model overfitting.

ArmelRandy avatar Jul 11 '23 00:07 ArmelRandy

Hi @ArmelRandy , thank you for your response !

Sure following are the details - This is the command I used in order to fine-tune my model - CUDA_VISIBLE_DEVICES=1 nohup python -u finetune/finetune.py --model_path="bigcode/starcoder" --dataset_name="pymapdl_ft" --subset="/raid/ansysai/ruchaa/projects/pymapdlAI/starcoder/pymapdl_ft_data.csv" --split=92 --size_valid_set 10000 --seq_length 3072 --max_steps 1000 --batch_size 1 --input_column_name="Prompt" --output_column_name="Completion" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir="./checkpoints"

Size of the train set: 92 Size of the validation set: 15

Highest Number of Tokens in a Prompt-Completion Pair 3138 Lowest Number of Tokens in a Prompt-Completion Pair 79 Character per Token Ratio 2.577632117984433

ruchaa0112 avatar Jul 11 '23 00:07 ruchaa0112

Okay it looks like you are using a little dataset. Keep in mind that in the fine-tuning script we concatenate all the inputs (here instruction+output) into a single sentence that we divide into blocks of size seq_length. This can reduce the number of actual examples that you have in your dataset. Using batch_size=1 and gradient_accumulation_steps=16 is probably too much for your dataset because it results into a big actual_batch_size. What I would advice you to do for now is to reduce the values of some arguments, namely you can use --gradient_accumulation_steps 2, --log_freq 1, --eval_freq 1 and --save_freq 5, --num_warmup_steps 3. Long story short, you are probably doing too much steps.

ArmelRandy avatar Jul 11 '23 00:07 ArmelRandy

Thank you so much for your prompt response @ArmelRandy Let me try with the settings you mentioned. :)

I also wanted to know approximately how much examples should I target for training and test data? Also is it advisable to use inputs as prompts and labels as completions from the code instead of concatenating the input into instruction + output ?

ruchaa0112 avatar Jul 11 '23 01:07 ruchaa0112

I don't really know if you should target a specific amount of example for training and test data. But sure it is better if your dataset in reasonably big, however I can't tell anything precise. The thing is to adapt the parameter to the size of your dataset that's it.

For your question about using inputs as prompts and labels as completions from the code, it would amount of building a Seq2seq model right? It is likely that a decoder based model would have trouble with such framework.

The whole point of the concatenation instruction+output is to mimic the setting in which StarCoder was trained, which is next token prediction.

ArmelRandy avatar Jul 11 '23 01:07 ArmelRandy

Got it , thank you so much for your response. I really appreciate your input.

Also , if I use the same data and convert it to ChatML format to finetune and make it a StarCoder model, will it work? Or would you recommend first finetuning the model as Prompt Completion concatenated, and then using this model and OpenAsst data to make it a chatty model?

ruchaa0112 avatar Jul 11 '23 01:07 ruchaa0112

If you turn your dataset into a "chat dataset", yes you should be able to fine-tune StarCoder for chat purposes with it. I think both approaches are worth trying, they will probably yield interesting outputs. Another thing that you can do is to mix your dataset (rephrased as chat) and oasst (be aware of ratio) and directly fine-tune StarCoder for chat.

ArmelRandy avatar Jul 11 '23 01:07 ArmelRandy

Thanks. That is a really interesting suggestion ! Thank you so much for all your help ! This has really helped me clear my doubts

ruchaa0112 avatar Jul 11 '23 01:07 ruchaa0112

@ArmelRandy - Hi Armel , as per your suggestion, I tried to modify the parameters for the Trainer class and the command looks like this - CUDA_VISIBLE_DEVICES=1 nohup python -u finetune/finetune.py --model_path="bigcode/starcoder" --dataset_name="pymapdl_ft" --subset="/raid/ansysai/ruchaa/projects/pymapdlAI/starcoder/pymapdl_ft_data.csv" --split=92 --size_valid_set 10000 --seq_length 2048 --max_steps 1000 --batch_size 1 --input_column_name="Prompt" --output_column_name="Completion" --gradient_accumulation_steps 2 --learning_rate 1e-4 --log_freq 1 --eval_freq 1 --save_freq 5 --lr_scheduler_type="cosine" --num_warmup_steps 3 --weight_decay 0.05 --output_dir="./checkpoints"

The eval loss seems to decrease now , but the train loss is increasing. Training Loss image

Evaluation Loss image

ruchaa0112 avatar Jul 11 '23 16:07 ruchaa0112

Hi. The training loss seems to increase for 5 consecutive steps, but it is not enough to conclude. We can be in a case of oscillations. You should try to see how the training loss behaves for more steps.

ArmelRandy avatar Jul 17 '23 07:07 ArmelRandy