LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Question] Script question when pretrain with a model_base of LLaVA-1.5

Open TesiLin opened this issue 1 year ago • 7 comments

Question

Hi, thank you for your excellent work.

I've been trying to do the pre-training process based on LLaVA-1.5 to utilize the good performance of this new version. Could you please help to check if my understanding of the settings is correct?

From my understanding, during pre-training I should:

  1. Set the "model_name_or_path" parameter to "liuhaotian/llava-v1.5-13b" to initialize the model weights of LLaVA-1.5.
  2. Set the "version" parameter to "v1".

Above will make it read the mm_mlp_adapter and LLM weights from liuhaotian/llava-v1.5-13b, freeze LLM part, start to train the adapter. Or should I add "pretrain_mm_mlp_adapter" parameter and download the checkpoints from somewhere? I did not find it alone in model zoo.

During finetuning period, I should:

  1. Set the "model_name_or_path" parameter to pretrain stage checkpoints.
  2. Keep the "version" parameter to "v1".

Btw, how should I decide the value of parameter "version"? Is there typically a relationship between it and the actual model version number?

Thanks for your help!

TesiLin avatar Nov 07 '23 04:11 TesiLin

Hi! I'm not the author but I may be able to help with some of the questions here.

Set the "model_name_or_path" parameter to "liuhaotian/llava-v1.5-13b" to initialize the model weights of LLaVA-1.5.

This will load a pre-trained and instruction tuned llava model. If you are pre-training from scratch, you do not need to load this. Instead, set model_name_or_path to some base LLM(like lmsys/vicuna-7b-v1.5 or the MPT base model). Set the vision tower to a CLIP model(this repo suggests: openai/clip-vit-large-patch14-336) - doing so automatically initializes the mm_adapter. To train it, you can set the relevant flag to True.

Set the "version" parameter to "v1".

To the best of my knowledge, the version affects the conversation format that would be used during inference. This should be inline with your base LLM. Checkout llava/conversations.py to understand this better.

Set the "model_name_or_path" parameter to pretrain stage checkpoints

No, use the same model_name_or_path, since pertaining does not affect the LLM parameters. Also add the flags to tune the adapter, and enable lora. To load the pertained mm_adapter, use the pretrain_mm_mlp_adapter flag.

Keep the "version" parameter to "v1". Is there typically a relationship between it and the actual model version number?

No relation that I am aware of. But the version should be the same during instruction tuning.

Please feel free to correct me if I anything I mentioned seems wrong!

devaansh100 avatar Nov 07 '23 04:11 devaansh100

Thanks @devaansh100 , it helps a lot. And I have checked the conversation templates, which make it more clear, thank you for your suggestion.

I have two more things I'd like to confirm.

I want to perform two-stage training on sub-domain tasks, based on the fine-tuned weights of LLaVA-1.5, just like what LLaVA-Med did. Should I directly load 'liuhaotian/llava-v1.5-13b' for this?

Then, during fine tuning, by setting the pretrain_mm_mlp_adapter, new projector will override those of LLaVA-v1.5? Thanks again.

TesiLin avatar Nov 07 '23 05:11 TesiLin

I'm not completely sure how that would work. Ideally, to load the entire HuggingFace model, you would use load_pretrained_model from llava/model/builder.py. However, that is not used during training.

One hack that I can think of is to use whatever commands I mentioned in the previous message. Then you have a model in the original format(let's call this model A).

Post that, add a function to reinitialize model A with the weights of liuhaotian/llava-v1.5-13b(loaded with the aforementioned function), then remove all unnecessary/redundant weights from memory. Expect this to be a bit slow though.

It's also worth trying to do this without creating model A(you will need to check if this affects other settings in train.py).

For fine-tuning, yes, just set that flag and load the relevant files from output_dir.

devaansh100 avatar Nov 07 '23 05:11 devaansh100

I see. I will have a try. Thanks again! Your reply truly helped me.

TesiLin avatar Nov 07 '23 05:11 TesiLin

Hi

I want to perform two-stage training on sub-domain tasks, based on the fine-tuned weights of LLaVA-1.5, just like what LLaVA-Med did. Should I directly load 'liuhaotian/llava-v1.5-13b' for this?

I am thinking the same thing. If you don't mind, could you share the script? Thank you.

unmo avatar Nov 21 '23 01:11 unmo

@unmo Hi! did you solve it

459737087 avatar Jan 11 '24 11:01 459737087

No, I haven't solved it yet.

unmo avatar Jan 12 '24 00:01 unmo