LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Question] Overfitting in my finetune experiment using my custom data

Open Pro-flynn opened this issue 1 year ago • 32 comments

Question

After finetuning using the my custorm data, the finetuned llava model is overfitting. In my experiments, I following the your instrcuction( cited in https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md).

  1. convert my data to the required format, as follows:
{
       "id": "mamian_fengwo_000252",
       "image": "mamian_fengwo_000252.jpg",
       "conversations": [
           {
               "from": "human",
               "value": "<image>\nWhere is honeycombing on pillars in the image? answer in [[x0,y0,x1,y1]] format."
           },
           {
               "from": "gpt",
               "value": "[[0.7677605, 0.815028, 0.8906875, 0.92288], [0, 0.675963, 0.03476, 0.890241], [0.664312, 0.7921855, 0.7664839999999999, 0.9241485], [0.1377295, 0.7824074999999999, 0.2766145, 0.9952505]]"
           }
       ]
   },
  1. use the office scripts (cited in https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh), as follows:
deepspeed llava/train/train_mem.py \                                                                                                               
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \                                                                      
    --deepspeed ./scripts/zero3.json \                                                                                                             
    --model_name_or_path liuhaotian/llava-v1.5-13b \                                                                                               
    --version v1 \                                                                                                                                 
    --data_path ./playground/data/llava_lora_finetune_mamianfengwo_floatanno_xywh_train.json \                                                     
    --image_folder ./playground/data/images \                                                                                                      
    --vision_tower openai/clip-vit-large-patch14-336 \                                                                                             
    --mm_projector_type mlp2x_gelu \                                                                                                               
    --mm_vision_select_layer -2 \                                                                                                                  
    --mm_use_im_start_end False \                                                                                                                  
    --mm_use_im_patch_token False \                                                                                                                
    --image_aspect_ratio pad \                                                                                                                     
    --group_by_modality_length True \                                                                                                              
    --bf16 True \                                                                                                                                  
    --output_dir ./checkpoints/llava-v1.5-13b-task-lora_100epoch_floatanno_xywh_train.json \                                                       
    --num_train_epochs 100 \                                                                                                                       
    --per_device_train_batch_size 16 \                                                                                                             
    --per_device_eval_batch_size 4 \                                                                                                               
    --gradient_accumulation_steps 1 \                                                                                                              
    --evaluation_strategy "no" \                                                                                                                   
    --save_strategy "steps" \                                                                                                                      
    --save_steps 50000 \                                                                                                                           
    --save_total_limit 10\                                                                                                                         
    --learning_rate 2e-4 \                                                                                                                         
    --weight_decay 0. \                                                                                                                            
    --warmup_ratio 0.03 \                                                                                                                          
    --lr_scheduler_type "cosine" \                                                                                                                 
    --logging_steps 1 \                                                                                                                            
    --tf32 True \                                                                                                                                  
    --model_max_length 2048 \                                                                                                                      
    --gradient_checkpointing True \                                                                                                                
    --dataloader_num_workers 4 \                                                                                                                   
    --lazy_preprocess True \ 

we find the finetuned llava model is underfitting by seting the epoch as 1-10, so we setting the epoch as 50-100, however the finetuned model is overfitting.

  1. We find that the train loss=0 when the training is ending, and the performacen in test data is very poor.

Pro-flynn avatar Nov 23 '23 13:11 Pro-flynn

How do you think I should adjust my training strategy?

Pro-flynn avatar Nov 23 '23 13:11 Pro-flynn

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

Linziyang1999 avatar Nov 24 '23 02:11 Linziyang1999

What's the current inference performance? Do you think LLava is suitable for this kind of object detection task?

FHL1998 avatar Nov 24 '23 02:11 FHL1998

Maybe you can check ocr llava,someone already did it. And they use ocr dataset both in pretrain and finetune.

Linziyang1999 avatar Nov 25 '23 03:11 Linziyang1999

Llm has show outstanding performance in ocr . I think llava can made it

Linziyang1999 avatar Nov 25 '23 03:11 Linziyang1999

https://llavar.github.io/ Check this

Linziyang1999 avatar Nov 25 '23 03:11 Linziyang1999

I've also adopted a similar approach for training my model. However, I find myself perplexed upon reviewing the training statistics.

wandb: Run history:
wandb:                    train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:              train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:            train/learning_rate ▄███████▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                     train/loss █▆▇▆▆▆▆▅▆▅▄▄▄▃▂▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb: 
wandb: Run summary:
wandb:                    train/epoch 20.0
wandb:              train/global_step 360
wandb:            train/learning_rate 0.0
wandb:                     train/loss 0.0072
wandb:               train/total_flos 1967070044160.0
wandb:               train/train_loss 0.4161
wandb:            train/train_runtime 848.715
wandb: train/train_samples_per_second 3.323
wandb:   train/train_steps_per_second 0.424

I'm puzzled about the distinction between train/train_loss with a value of 0.4161 and train/loss with a value of 0.0072. Could someone please clarify this for me?

Nomiluks avatar Nov 27 '23 08:11 Nomiluks

Also, I have noticed the same issue. The results on the unseen dataset is really bad

Nomiluks avatar Nov 27 '23 10:11 Nomiluks

Could anyone tell what hardware are you guys finetuning on? I tried on one A10G with batch_per_device =1, But getting OOM error.

aneet-javis avatar Nov 27 '23 10:11 aneet-javis

After trying with a couple of different machines I used the A100 GCP instance and it worked like a charm.

Nomiluks avatar Nov 27 '23 11:11 Nomiluks

You can try lowering the number of epochs. Check out the example here, I finetuned for 3 epochs with batch size 8 on a 100 GPT-4V captioned anime examples, and it already works great: https://github.com/haotian-liu/LLaVA/issues/766#issuecomment-1800214174. You an also take a look at the wandb logs, the training loss should not be too low, which indicates overfitting. Additionally, fusing a few samples from LLava-instruct or llava-v1.5 data mixture may also help reduce the overfitting.

@Nomiluks one of them is probably the end-of-epoch stats (there will be just one number for a single experiment), and the other may be the last iter stats (one number for each iterations, but only the last iter is displayed), looking at the wandb interface may allow you better understand the stats.

haotian-liu avatar Nov 27 '23 16:11 haotian-liu

You can try lowering the number of epochs. Check out the example here, I finetuned for 3 epochs with batch size 8 on a 100 GPT-4V captioned anime examples, and it already works great: #766 (comment). You an also take a look at the wandb logs, the training loss should not be too low, which indicates overfitting. Additionally, fusing a few samples from LLava-instruct or llava-v1.5 data mixture may also help reduce the overfitting.

@Nomiluks one of them is probably the end-of-epoch stats (there will be just one number for a single experiment), and the other may be the last iter stats (one number for each iterations, but only the last iter is displayed), looking at the wandb interface may allow you better understand the stats.

Thank for your response @haotian-liu

I'm working on implementing LLaVA to identify pixel-based image forgery or tampering in my dataset. I currently have 100 samples, and I'm considering LORA based fine-tuning as suggested in the documentation. Do you believe this sample size is sufficient for effective fine-tuning? Additionally, I'm open to any advice or best practices for training LLaVA to specifically detect image forgery. Your insights would be greatly appreciated!

Training Example:

    {
        "id": "tampered_654c8796140dc970e0d179d5-back",
        "image": "tampered_654c8796140dc970e0d179d5-back.jpeg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nAnalyze the provided document image with the objective of detecting potential instances of image forgery resulting from digital tampering or manipulation. Identify all manipulated regions and present the results in the following format: [[x0, y0, x1, y1]]. If no tampered regions are identified, please return [[]]."
            },
            {
                "from": "gpt",
                "value": "[[0.578, 0.604, 0.938, 0.99]]"
            }
        ]
    }

Nomiluks avatar Nov 28 '23 07:11 Nomiluks

I wonder if LLaVA faces a brand new domain, should we do something like fine-tuning the visual encoder at the first step cause right now the vision encoder is not tuned?

FHL1998 avatar Nov 28 '23 08:11 FHL1998

@Nomiluks According to my experiment, a size of 100 may easily cause overfitting, I tried to enlarge my dataset to 8000 entries (containing a few LLaVA instructions). However, the result shows the descending performance even cannot interpret the "man behind the taxi" example, I am still figuring out the cause.

FHL1998 avatar Nov 28 '23 09:11 FHL1998

@Nomiluks According to my experiment, a size of 100 may easily cause overfitting, I tried to enlarge my dataset to 8000 entries (containing a few LLaVA instructions). However, the result shows the descending performance even cannot interpret the "man behind the taxi" example, I am still figuring out the cause.

Yes, I am also having the same problem, have you found out the cause?

ronnymunthe99 avatar Nov 28 '23 09:11 ronnymunthe99

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

Is the 0.67 the overall loss you're referring to? It seems a bit high; typically, we aim for a loss close to 0 for a well-fit model. This value might suggest that the model is underfitting. Could you provide more context or details about the training process? It's important to assess whether this level of loss is acceptable for your specific use case.

Nomiluks avatar Nov 28 '23 09:11 Nomiluks

@Nomiluks According to my experiment, a size of 100 may easily cause overfitting, I tried to enlarge my dataset to 8000 entries (containing a few LLaVA instructions). However, the result shows the descending performance even cannot interpret the "man behind the taxi" example, I am still figuring out the cause.

yeah, it seems it is unable to learn either the model gets overfit and underfit.

Nomiluks avatar Nov 28 '23 09:11 Nomiluks

I am wondering how big of a difference is the domain shift? For example, for the extremely detailed anime captioning, I was actually surprised by what it can do with 100 examples: https://github.com/haotian-liu/LLaVA/issues/766#issuecomment-1800214174

haotian-liu avatar Nov 28 '23 16:11 haotian-liu

I am wondering how big of a difference is the domain shift? For example, for the extremely detailed anime captioning, I was actually surprised by what it can do with 100 examples: #766 (comment)

@haotian-liu Here are two examples from my side, and the loss curve in 3 epochs: 1701216545837 1701216798648

fd6ddbf168bc68cb276eb2e45646f6d

FHL1998 avatar Nov 29 '23 00:11 FHL1998

The loss curve is very concerning here. Here is one of the LoRA finetuning loss curve on stable diffusion prompts.

image

The initial spike suggests that there is something wrong.

haotian-liu avatar Nov 29 '23 00:11 haotian-liu

@Pro-xiaowen

Btw, just noticed this: [0.7677605, 0.815028, 0.8906875, 0.92288] These coordinates seems overly accurate. You may just need three digits. The later digits may just cause the model to hallucinate.

haotian-liu avatar Nov 29 '23 00:11 haotian-liu

The loss curve is very concerning here. Here is one of the LoRA finetuning loss curve on stable diffusion prompts.

image The initial spike suggests that there is something wrong.

@haotian-liu Thx for your reply! May I ask how many samples are included in the dataset, I mean the extra LLaVA instruction samples and the total number of samples.

FHL1998 avatar Nov 29 '23 02:11 FHL1998

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

Is the 0.67 the overall loss you're referring to? It seems a bit high; typically, we aim for a loss close to 0 for a well-fit model. This value might suggest that the model is underfitting. Could you provide more context or details about the training process? It's important to assess whether this level of loss is acceptable for your specific use case.

thus llm generate more word beyond your answer, it not mean the answer is wrong. after experiment loss between 0.6~0.8 is normal. if you want model more accuracy, you may focus on improve size of dataset. here is my loss, and model is work well. ^_^, hope it can help you. 截屏2023-11-29 11 24 53

Linziyang1999 avatar Nov 29 '23 03:11 Linziyang1999

thus llm generate more word beyond your answer, it not mean the answer is wrong. after experiment loss between 0.6~0.8 is normal. if you want model more accuracy, you may focus on improve size of dataset. here is my loss, and model is work well. ^_^, hope it can help you. 截屏2023-11-29 11 24 53

@Linziyang1999 May I ask the number of samples included in your dataset (How many customized samples and original LLaVA samples)?

FHL1998 avatar Nov 29 '23 03:11 FHL1998

thus llm generate more word beyond your answer, it not mean the answer is wrong. after experiment loss between 0.6~0.8 is normal. if you want model more accuracy, you may focus on improve size of dataset. here is my loss, and model is work well. ^_^, hope it can help you. 截屏2023-11-29 11 24 53

@Linziyang1999 May I ask the number of samples included in your dataset (How many customized samples and original LLaVA samples)?

custom sample is 20k, and i found there will be an error raised if dataset only have image conversation during train so i add few conversation in mix665k without image(10 maybe? just make it work well).

Linziyang1999 avatar Nov 29 '23 03:11 Linziyang1999

@haotian-liu In my case, the loss seems to drop so quickly after only 30 steps, I have checked three things:

  1. I have already enlarged my dataset to 20k samples by mixing my customized dataset with LLaVA instruction samples;
  2. I have checked the dataset format (id, image, etc);
  3. Everything went well during the finetuning phase (no error, warning, or size mismatch).

1701234857439

Any obvious error that can be observed from my fine-tuning script or does anyone have any idea about what happened? B.T.W, I used 4 A100 (80GB).

deepspeed llava/train/train_mem.py \                                                                                                               
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \                                                                      
    --deepspeed ./scripts/zero3.json \                                                                                                             
    --model_name_or_path liuhaotian/llava-v1.5-13b \                                                                                               
    --version v1 \                                                                                                                                 
    --data_path dataset_finetune/llava_finetune_task_v2.json \                                                     
    --image_folder ./playground/data/images \                                                                                                      
    --vision_tower openai/clip-vit-large-patch14-336 \                                                                                             
    --mm_projector_type mlp2x_gelu \                                                                                                               
    --mm_vision_select_layer -2 \                                                                                                                  
    --mm_use_im_start_end False \                                                                                                                  
    --mm_use_im_patch_token False \                                                                                                                
    --image_aspect_ratio pad \                                                                                                                     
    --group_by_modality_length True \                                                                                                              
    --bf16 True \                                                                                                                                  
    --output_dir ./checkpoints/llava-v1.5-13b-task-lora-v2 \                                                       
    --num_train_epochs 3 \                                                                                                                       
    --per_device_train_batch_size 8 \                                                                                                             
    --per_device_eval_batch_size 2 \                                                                                                               
    --gradient_accumulation_steps 4 \                                                                                                              
    --evaluation_strategy "no" \                                                                                                                   
    --save_strategy "steps" \                                                                                                                      
    --save_steps 50000 \                                                                                                                           
    --save_total_limit 10\                                                                                                                         
    --learning_rate 2e-4 \                                                                                                                         
    --weight_decay 0. \                                                                                                                            
    --warmup_ratio 0.03 \                                                                                                                          
    --lr_scheduler_type "cosine" \                                                                                                                 
    --logging_steps 1 \                                                                                                                            
    --tf32 True \                                                                                                                                  
    --model_max_length 2048 \                                                                                                                      
    --gradient_checkpointing True \                                                                                                                
    --dataloader_num_workers 2 \                                                                                                                   
    --lazy_preprocess True \ 

FHL1998 avatar Nov 29 '23 05:11 FHL1998

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

we found the finetuned llava model is underfitting by seting the epoch as 1-10, even the prediction of train data is wrong! @Linziyang1999

Pro-flynn avatar Nov 30 '23 08:11 Pro-flynn

we found the finetuned llava model is underfitting by seting the epoch as 1-10, even the prediction of train data is wrong!

so num_train_epochs 100 will make your loss smaller and the prediction more accurate? (I also encountered trouble, my fine-tuning didn't work) @Pro-xiaowen

CrazyBrick avatar Nov 30 '23 13:11 CrazyBrick

Hi guys, I am on colab running it on A100 and trying to fine-tuned using the below code, facing error like ./checkpoint or train_men.py and other train.py files

My code

!git clone https://github.com/haotian-liu/LLaVA.git
%cd /content/LLaVA
!pip install -q gradio .

!bash /content/LLaVA/scripts/v1_5/finetune.sh```


Can you guys help me with the correct way to fine-tune it?

rohitpanjwani03 avatar Dec 04 '23 09:12 rohitpanjwani03

Gibberish output (even on train data) with wierd loss curve on fully finetuning. Can someone please help me fix this.

I am trying to fully finetune the entire text-only model Vicuna-v1.5 using my custom QnA data comprising of 160k qa pairs, using the same finetuning script as provided in finetune_task.sh by omitting the multimodal parameters. Here is the loss curve on 2.4 epochs. wandb reportimage

ninjacode01 avatar Feb 11 '24 08:02 ninjacode01