LM4VisualEncoding icon indicating copy to clipboard operation
LM4VisualEncoding copied to clipboard

Influence of ViT

Open jiazhen-code opened this issue 4 months ago • 0 comments

Thank you for your insightful discovery. I have a question regarding the influence of ViT. If you use a pre-trained ViT and freeze it, then only train the added adapter layer while also freezing the LLaMA block, will the performance consistently improve?

Additionally, would using multimodal-aligned LLMs like LLaMA in LLaVA achieve better performance compared to the original LLaMA? I find it fascinating to explore these aspects as they could provide clearer guidance on utilizing LLM-blocks in vision components.

jiazhen-code avatar Sep 27 '24 05:09 jiazhen-code