LLaVA
LLaVA copied to clipboard
[Usage] Inference Speed Issue with LoRA Fine-tuned Model on ScienceQA
Hi Haotian,
Thank you for your incredible work on this project.
I am encountering an issue during inference. When I use the non-LoRA weights for inference on ScienceQA, the speed is approximately 1 second per sample. However, when I switch to the LoRA fine-tuned model, the inference speed drastically increases to over 40 seconds per sample.
Here is the command I am using for fine-tuning (trained on 1 V100 with lora_r=4, bf16=False, tf32=False):
CUDA_VISIBLE_DEVICES=1 python3 llava/train/train.py \
--lora_enable True --lora_r 4 --lora_alpha 256 --mm_projector_lr 2e-5 \
--model_name_or_path ./LLAVA-1.5/llava-v1.5-7b/ \
--version v1 \
--data_path ./playground/data/eval/scienceqa/llava_train_CQM-A.json \
--image_folder ./data/ScienceQA/image/train/ \
--vision_tower ./data/clip-vit-large-patch14-336/ \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 False \
--output_dir ./LLaVA-v1.5-7b-lora \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
Here is the command I am using for inference:
CUDA_VISIBLE_DEVICES=3 python3 -m llava.eval.model_vqa_science \
--model-path ./LLaVA-v1.5-7b-lora/checkpoint-50000/ \
--model-base ./LLAVA-1.5/llava-v1.5-7b/ \
--question-file ./playground/data/eval/scienceqa/llava_test_CQM-A.json \
--image-folder ./data/ScienceQA/image/test/ \
--answers-file ./playground/data/eval/scienceqa/answers/llava-v1.5-7b-lora-50000.jsonl \
--single-pred-prompt \
--temperature 0 \
--conv-mode vicuna_v1
Could you please help me understand why the inference speed difference between the two models is significant?
Thank you!
Screenshots:
adapter_config.json:
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "./data/LLAVA-1.5/llava-v1.5-7b/",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 256,
"lora_dropout": 0.05,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 4,
"rank_pattern": {},
"revision": null,
"target_modules": [
"down_proj",
"o_proj",
"q_proj",
"gate_proj",
"up_proj",
"v_proj",
"k_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": false
config.json
{
"_name_or_path": "./data/LLAVA-1.5/llava-v1.5-7b/",
"architectures": [
"LlavaLlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"freeze_mm_mlp_adapter": false,
"freeze_mm_vision_resampler": false,
"hidden_act": "silu",
"hidden_size": 4096,
"image_aspect_ratio": "pad",
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_length": 4096,
"max_position_embeddings": 4096,
"mm_hidden_size": 1024,
"mm_patch_merge_type": "flat",
"mm_projector_lr": 2e-05,
"mm_projector_type": "mlp2x_gelu",
"mm_resampler_type": null,
"mm_use_im_patch_token": false,
"mm_use_im_start_end": false,
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"mm_vision_tower": "./data/clip-vit-large-patch14-336/",
"model_type": "llava_llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"tokenizer_model_max_length": 2048,
"tokenizer_padding_side": "right",
"torch_dtype": "float16",
"transformers_version": "4.37.2",
"tune_mm_mlp_adapter": false,
"tune_mm_vision_resampler": false,
"unfreeze_mm_vision_tower": false,
"use_cache": true,
"use_mm_proj": true,
"vocab_size": 32000
}
hello, can you leave a contact for me? I am also trying to deploy the LLaVa on the Tesla V100.
Same issue, extremely slow after adding lora weights
same issue,did you solve it? a thanks
Same issue, extremely slow after adding lora weights
Same issue, extremely slow after adding lora weights and empty answers. Do you solve it?
I had this same issue.
I found it might because the merged mdel has ill-conditioned matrices (where highly like you did not change the defualt alpha value when you drop the rank from 128 to 4 in your case). The output logs can tell you something and I found the loss dramatically increases after certain epoch (0.15 in my case).
————————————————————————— Just a quick update, after I made correct alpha pairs, issues have gone. Dont forget to change the alpha values as well!