[Question] How to migrate and use it on huggingface’s visual language model?

Open jiabao-wang opened this issue 1 year ago • 1 comments

How to migrate and use it on huggingface’s visual language model?

How to migrate and use it on huggingface’s visual language model? For example, Salesforce/blip-vqa-capfilt-large,xtuner/llava-internlm2-7b https://huggingface.co/xtuner/llava-internlm2-7b https://huggingface.co/Salesforce/blip-vqa-capfilt-large

Thank you so much!

Feb 01 '24 04:02 jiabao-wang

How to migrate and use it on huggingface’s visual language model?

How to migrate and use it on huggingface’s visual language model? For example, Salesforce/blip-vqa-capfilt-large,xtuner/llava-internlm2-7b https://huggingface.co/xtuner/llava-internlm2-7b https://huggingface.co/Salesforce/blip-vqa-capfilt-large

Thank you so much!

Hello! I've forked and make llava support. It works as follow:

Encode image using visual tower (clip) and get image features
Project them to llm space
Get embeddings of text tokens
Merge text embeddings and patch embeddings together (I found function in LlavaForConditionalGeneration that do so).

Then we need load vicuna from llava as HookedTransformer and then we can pass merged embeddings to HookedTransformer and get cache with activations and attention matrices.

I've made pull request recently for possibility of all above actions, maybe some time those or similar changes will be on main branch of the project, but now you can clone my forked version (there also little demonstration in demos/LLaVA.ipynb presents) Here is the link https://github.com/zazamrykh/TransformerLens/tree/feature-llava-support

Dec 19 '24 10:12 zazamrykh