LLaVA [Usage] Cannot load the tuned projector weights

Describe the issue

Issue: I want to fine-tune a multi-modal LLM on a downstream task that uses both images and text. This is what I've done:

I tried to use LLaMA 2 Chat as LLM for LLaVA, I tuned the whole model using the script in finetune_lora.sh and successfully saved it.
I loaded the tuned model using the eval.py script you provided like this:

  nohup python -u ./llava/eval/run_llava.py \
    --model-path ./checkpoints/$MODEL_NAME \
    --model-base /models/$MODEL_BASE \
    --input-file-path ./dataset/test.xlsx \
    --image-path ./dataset/images

where: a. $MODEL_NAME is the folder where the result of 1. was saved b. $MODEL_BASE is the the local model path downloaded from LLaMA 2 Chat c. --input-file-path and --image-path are folders and the eval.py script has been modified in order to read all the texts and images in the folder

But I think there is a problem with the projector: I cannot figure out how to save those weights because, as I understood from your paper, those weights are still tuned during my procedure. When I load the model for the evaluation with the above code I get a warning and disastrous outputs: Log:

Some weights of LlavaLlamaForCausalLM were not initialized from the model checkpoint at /models/Llama-2-13b-chat-hf and are newly initialized: ['model.mm_projector.bias', 'model.mm_projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Finally, I tried to explicitly tune the projector weights by passing the argument --tune_mm_mlp_adapter True but the results are the same. Any thoughts @haotian-liu?

Thank you in advance.

Oct 05 '23 10:10 basteran

Met the same issue here, may I ask if you have solved it?

Oct 13 '23 04:10 FHL1998

Hi @basteran @FHL1998

The error message is expected, as that appears when it tries to load the --model-base (in your case llama-2-chat) as a llava-llama-2 model. it will be loaded later.

Do you see any actual errors, or the results are terrible?

Also, try the commands here, which is a lora we have trained: https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker

Oct 13 '23 04:10 haotian-liu

Hi @basteran @FHL1998

The error message is expected, as that appears when it tries to load the --model-base (in your case llama-2-chat) as a llava-llama-2 model. it will be loaded later.

Do you see any actual errors, or the results are terrible?

Also, try the commands here, which is a lora we have trained: https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker

Thx for your kind reply. I faced another issue related to this. After performing instruction fine-tuning without Lora (using finetune.sh directly), I used the the script below for gradio demo:

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/llava-v1.5-13b-fine-tune

This script raise another similar warning Some weights of the model checkpoint at ./checkpoints/llava-v1.5-13b-fine-tune were not used when initializing LlavaLlamaForCausalLM: ['model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias', ...]

I wonder if this is normal, cause I find the response is not affected by the warning message. Does this mean I need to further operate on the checkpoints after fine-tuning? This is similar to issue #382. Thx in advance if you can guide me through this @haotian-liu .

I attached config.json within ./checkpoints/llava-v1.5-13b-fine-tune as well, I did not see obvious difference with : { "_name_or_path": "lmsys/vicuna-7b-v1.5", "architectures": [ "LlavaLlamaForCausalLM" ], "bos_token_id": 1, "eos_token_id": 2, "freeze_mm_mlp_adapter": false, "hidden_act": "silu", "hidden_size": 4096, "image_aspect_ratio": "pad", "image_grid_pinpoints": null, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "mm_hidden_size": 1024, "mm_projector_type": "mlp2x_gelu", "mm_use_im_patch_token": false, "mm_use_im_start_end": false, "mm_vision_select_feature": "patch", "mm_vision_select_layer": -2, "mm_vision_tower": "openai/clip-vit-large-patch14-336", "model_type": "llava", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.31.0", "tune_mm_mlp_adapter": false, "use_cache": true, "use_mm_proj": true, "vocab_size": 32000 }

Oct 13 '23 05:10 FHL1998

@FHL1998 you do not need to do anything else. This is normal as this is because DeepSpeed saves all parameters including the vision tower. If you want to remove the vision tower as the checkpoint we released, you can do this:

python -m llava.model.consolidate --src model --dst model_consolidate

The model prediction would be the same regardless of you do anything like that.

Oct 13 '23 05:10 haotian-liu

Hi @basteran @FHL1998

The error message is expected, as that appears when it tries to load the --model-base (in your case llama-2-chat) as a llava-llama-2 model. it will be loaded later.

Do you see any actual errors, or the results are terrible?

Thank you for getting back to me. I see no error messages, just the results are terrible like the model is tuned only on the texts and disregards the images.

Also, try the commands here, which is a lora we have trained: https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker

Yes, I have tried those commands and with your model the results seem reasonable, when I load my model they are the same as using the eval.py script. What do you think? @haotian-liu

Oct 16 '23 11:10 basteran

@basteran

just the results are terrible like the model is tuned only on the texts and disregards the images.

What is your model_path? does it contain both "llava" and "llama_2" or something?

Also, if you can share the whole command you used for tuning the lora, it may allow me to better understand.

Oct 16 '23 15:10 haotian-liu

https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker

Hi haotian, I have the similar question.

Is mm_proj tuned during the fine-tuning stage of Llava if using Lora? (I guess the answer should be yes)
If it is tuned, will the tuned projector be saved and loaded to the model during inference?

Oct 17 '23 02:10 tingxueronghua

https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker

Hi haotian, I have the similar question.

Is mm_proj tuned during the fine-tuning stage of Llava if using Lora? (I guess the answer should be yes)

If it is tuned, will the tuned projector be saved and loaded to the model during inference?

oh I think I get the point. The "non_lora_trainable.bin" will solve this problem. Sorry for interrupting.

Oct 17 '23 03:10 tingxueronghua

``> @basteran

just the results are terrible like the model is tuned only on the texts and disregards the images.

What is your model_path? does it contain both "llava" and "llama_2" or something?

This is the model_path = ./checkpoints/llava-llama2chat13b-tune_projector-finetune_lora

Also, if you can share the whole command you used for tuning the lora, it may allow me to better understand.

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --lora_enable True \
    --model_name_or_path ./models/llama2chat13b \
    --version llava_llama_2 \
    --data_path ./dataset/train.json \
    --image_folder ./dataset/images \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./models/llava-pretrain-llama-2-13b-chat/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llava-llama2chat13b-tune_projector-finetune_lora \
    --num_train_epochs 10 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "linear" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none

Oct 17 '23 08:10 basteran

Hi, these all look good to me. It is strange that there is such error printed. We have just released the LoRA ckpt/script for LLaVA-1.5: https://github.com/haotian-liu/LLaVA/releases/tag/v1.1.3

May be you can try with the new code/sample script?

Oct 26 '23 20:10 haotian-liu

Thank you @haotian-liu , I will try it as soon as possible and I will let you know ;)

Oct 27 '23 12:10 basteran

@basteran @FHL1998 Met the same issue here, may I ask if you have solved it?

Apr 02 '24 12:04 amanysalahfattouh

Hi @amanysalahfattouh , I didn't solve it. I just didn't load the projector weights.. and it's disappointing that after so many months, no one could solve it or provide a solution!

Apr 02 '24 12:04 basteran

@basteran @amanysalahfattouh @haotian-liu , have you tried using the flag --pretrain_mm_mlp_adapter with the path set to non_lora_trainables.bin of your finetuned model? I have been facing the same issue as discussed here in a different setting i.e., fine-tuning an instruction-tuned model trained on a custom dataset. I found this fix to work for my situation with a few more modifications. Specifically, I had to change this line:

if hasattr(config, "mm_vision_tower"):

to

if False and hasattr(config, "mm_vision_tower"):

to prevent self.mm_projector from being initialized with the rest of the model. I faced size mismatch errors without the above modification when running this line:

self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))

The size mismatch might be related to an issue with loading weights when zero3 is enabled.

Apr 17 '24 06:04 adymaharana

@basteran @amanysalahfattouh @haotian-liu , have you tried using the flag --pretrain_mm_mlp_adapter with the path set to non_lora_trainables.bin of your finetuned model? I have been facing the same issue as discussed here in a different setting i.e., fine-tuning an instruction-tuned model trained on a custom dataset. I found this fix to work for my situation with a few more modifications. Specifically, I had to change this line:

if hasattr(config, "mm_vision_tower"):

to

if False and hasattr(config, "mm_vision_tower"):

to prevent self.mm_projector from being initialized with the rest of the model. I faced size mismatch errors without the above modification when running this line:

self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))

The size mismatch might be related to an issue with loading weights when zero3 is enabled.

Thank you but my problem is that after fine-tuning I see no "mm_projector" file in the checkpoints, so I cannot load the for inference later..

Apr 17 '24 10:04 basteran

@basteran @amanysalahfattouh @haotian-liu , have you tried using the flag --pretrain_mm_mlp_adapter with the path set to non_lora_trainables.bin of your finetuned model? I have been facing the same issue as discussed here in a different setting i.e., fine-tuning an instruction-tuned model trained on a custom dataset. I found this fix to work for my situation with a few more modifications. Specifically, I had to change this line:

if hasattr(config, "mm_vision_tower"):

to

if False and hasattr(config, "mm_vision_tower"):

to prevent self.mm_projector from being initialized with the rest of the model. I faced size mismatch errors without the above modification when running this line:

self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))

The size mismatch might be related to an issue with loading weights when zero3 is enabled. I solved it by using the zero2 instead. Thank you very much! I am happy to help anyone who encounter this issue.

Apr 25 '24 06:04 dongwhfdyer

@basteran Were you able to get the solution for the mm_projector. I am also facing the same.

When I load the model for evaluation, it throws the below error, FileNotFoundError: [Errno 2] No such file or directory: 'train_cot_llava_llama2/mm_projector.bin'

Jun 12 '24 20:06 amandalmia14

LLaVA LLaVA copied to clipboard

[Usage] Cannot load the tuned projector weights

Describe the issue

LLaVA
LLaVA copied to clipboard