LLaVA
LLaVA copied to clipboard
[Question] Script question when pretrain with a model_base of LLaVA-1.5
Question
Hi, thank you for your excellent work.
I've been trying to do the pre-training process based on LLaVA-1.5 to utilize the good performance of this new version. Could you please help to check if my understanding of the settings is correct?
From my understanding, during pre-training I should:
- Set the "model_name_or_path" parameter to "liuhaotian/llava-v1.5-13b" to initialize the model weights of LLaVA-1.5.
- Set the "version" parameter to "v1".
Above will make it read the mm_mlp_adapter and LLM weights from liuhaotian/llava-v1.5-13b, freeze LLM part, start to train the adapter. Or should I add "pretrain_mm_mlp_adapter" parameter and download the checkpoints from somewhere? I did not find it alone in model zoo.
During finetuning period, I should:
- Set the "model_name_or_path" parameter to pretrain stage checkpoints.
- Keep the "version" parameter to "v1".
Btw, how should I decide the value of parameter "version"? Is there typically a relationship between it and the actual model version number?
Thanks for your help!
Hi! I'm not the author but I may be able to help with some of the questions here.
Set the "model_name_or_path" parameter to "liuhaotian/llava-v1.5-13b" to initialize the model weights of LLaVA-1.5.
This will load a pre-trained and instruction tuned llava model. If you are pre-training from scratch, you do not need to load this. Instead, set model_name_or_path
to some base LLM(like lmsys/vicuna-7b-v1.5
or the MPT base model). Set the vision tower to a CLIP model(this repo suggests: openai/clip-vit-large-patch14-336
) - doing so automatically initializes the mm_adapter. To train it, you can set the relevant flag to True.
Set the "version" parameter to "v1".
To the best of my knowledge, the version affects the conversation format that would be used during inference. This should be inline with your base LLM. Checkout llava/conversations.py
to understand this better.
Set the "model_name_or_path" parameter to pretrain stage checkpoints
No, use the same model_name_or_path
, since pertaining does not affect the LLM parameters. Also add the flags to tune the adapter, and enable lora. To load the pertained mm_adapter, use the pretrain_mm_mlp_adapter
flag.
Keep the "version" parameter to "v1". Is there typically a relationship between it and the actual model version number?
No relation that I am aware of. But the version should be the same during instruction tuning.
Please feel free to correct me if I anything I mentioned seems wrong!
Thanks @devaansh100 , it helps a lot. And I have checked the conversation templates, which make it more clear, thank you for your suggestion.
I have two more things I'd like to confirm.
I want to perform two-stage training on sub-domain tasks, based on the fine-tuned weights of LLaVA-1.5, just like what LLaVA-Med did. Should I directly load 'liuhaotian/llava-v1.5-13b' for this?
Then, during fine tuning, by setting the pretrain_mm_mlp_adapter,
new projector will override those of LLaVA-v1.5?
Thanks again.
I'm not completely sure how that would work. Ideally, to load the entire HuggingFace model, you would use load_pretrained_model
from llava/model/builder.py
. However, that is not used during training.
One hack that I can think of is to use whatever commands I mentioned in the previous message. Then you have a model in the original format(let's call this model A).
Post that, add a function to reinitialize model A with the weights of liuhaotian/llava-v1.5-13b
(loaded with the aforementioned function), then remove all unnecessary/redundant weights from memory. Expect this to be a bit slow though.
It's also worth trying to do this without creating model A(you will need to check if this affects other settings in train.py
).
For fine-tuning, yes, just set that flag and load the relevant files from output_dir
.
I see. I will have a try. Thanks again! Your reply truly helped me.
Hi
I want to perform two-stage training on sub-domain tasks, based on the fine-tuned weights of LLaVA-1.5, just like what LLaVA-Med did. Should I directly load 'liuhaotian/llava-v1.5-13b' for this?
I am thinking the same thing. If you don't mind, could you share the script? Thank you.
@unmo Hi! did you solve it
No, I haven't solved it yet.