LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

How to fine-tune the LLaVA-7b model ?

Open yunh-w opened this issue 1 year ago • 4 comments

Question

Hi, thanks on your great work!

I use the following command to fine-tune the LLaVA-7b model.

$PYTHON --nnodes=1 --nproc_per_node=8 --master_port=25001 \ llava/train/train_mem.py \ --model_name_or_path LLaMA-7b-convert \ --data_path $data_path \ --image_folder $image_folder \ --vision_tower $vision_tower \ --pretrain_mm_mlp_adapter LLaVA-7b-pretrain-projector-v0-CC3M-595K-original_caption.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end True \ --bf16 True \ --output_dir ./checkpoints/llava-7B_new \ --num_train_epochs 5 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 5 \ --save_total_limit 3 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True \ --report_to wandb

But three weights are obtained, when your LLaVA-7b weights number is two. And I get error when I load these fine-tuned weights. How to fine-tune the LLaVA-7b ? Thanks so much!

image image

OSError: Unable to load weights from pytorch checkpoint file for 'LLaVA-main/checkpoints/llava-7B_new/checkpoint-5/pytorch_model-00003-of-00003.bin' at 'LLaVA-main/checkpoints/llava-7B_new/checkpoint-5/pytorch_model-00003-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

I found that the third model was not saved completely. When saving, it was OOM, but the training did not stop.. Thanks.

yunh-w avatar May 10 '23 07:05 yunh-w

Me too! After I finetune 7B, the model I got is three bin files, but what you release is two bin files. The files I get from finetune are all very large, and the total_size in "pytorch_model.bin.index.json" is 26970595328, while what you release is only 13485301760.

image

Chen-Song avatar May 10 '23 07:05 Chen-Song

Hi @Chen-Song, you may notice that the size of your trained model is roughly 2x the size of the released checkpoints. This is because transformers saves the model weights with float32. When I release the weights, I convert them to float16 to save storage space / bandwidth.

@yunh-w Can you share the size of your trained model weights with ls -lt like @Chen-Song does? Thanks.

haotian-liu avatar May 10 '23 14:05 haotian-liu

@haotian-liu What is the process to convert float32 to float16? I have a 13B fine-tuned model that is 50G.

codybum avatar May 16 '23 15:05 codybum

@codybum You can use this script for compressing the model. Please make sure to set two different paths for the model instead of overwriting the fp32 model and only delete the fp32 source model after verifying the model is working properly. Thanks.

haotian-liu avatar May 16 '23 16:05 haotian-liu

How can we fine tune it on custom data, and what's the format of dataset to feed-in ?

anonymous-atom avatar Oct 13 '23 13:10 anonymous-atom

@anonymous-atom Here is an example dataset: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/detail_23k.json

You just need to take your data and make it conform to this set. You can then use the build scripts, substituting your datasets as the training set.

codybum avatar Oct 13 '23 15:10 codybum

@yunh-w Hi, what hardware did you use?

ali7919 avatar Jan 04 '24 20:01 ali7919