InternVL InternVL CLIP Compatibility

Hello, hope the authors are doing well. I am wondering if there is a way to get an InternVL model to function as a standard CLIP Vision tower. I am currently working on a project using the LLaVA codebase and I have been unsuccessful in retrofitting it for InternVLz

Apr 14 '24 02:04 puffy310

Seems I have found a similar discussion in issue #83, will close issue and continue dissent there. Edit: I said dissent instead of discussion, oops

Apr 14 '24 02:04 puffy310

Hi, you can use InternViT-6B as a standard CLIP vision towen.

Here is the sample code, I integrated InternViT-6B into LLaVA's codabase; However, this code version is not up to date, and if you need to use the latest LLaVA code, you will need to do some migration work.

https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat_llava/llava/model/multimodal_encoder/clip_encoder.py#L51-L57

Apr 16 '24 15:04 czczup

This is extremely helpful, I applaud your effort greatly. I will test it and let you know if there's any issue. Thank you so much 😀

Edit: this was already in the repo, thank you still though 😀

Apr 16 '24 17:04 puffy310

Update: It seems like an error consistently pops up repeatedly. I'm leaving this issue as a personal note to submit a more detailed error log. TL;DR of it is that it claims LlavaLlamaForCasualLM has not been imported, and that llava.model doesn't have it.

Apr 16 '24 21:04 puffy310

May I ask if this issue has been resolved?

Apr 26 '24 17:04 czczup