InternVL CLIP Compatibility
Hello, hope the authors are doing well. I am wondering if there is a way to get an InternVL model to function as a standard CLIP Vision tower. I am currently working on a project using the LLaVA codebase and I have been unsuccessful in retrofitting it for InternVLz
Seems I have found a similar discussion in issue #83, will close issue and continue dissent there. Edit: I said dissent instead of discussion, oops
Hi, you can use InternViT-6B as a standard CLIP vision towen.
Here is the sample code, I integrated InternViT-6B into LLaVA's codabase; However, this code version is not up to date, and if you need to use the latest LLaVA code, you will need to do some migration work.
https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat_llava/llava/model/multimodal_encoder/clip_encoder.py#L51-L57
This is extremely helpful, I applaud your effort greatly. I will test it and let you know if there's any issue. Thank you so much 😀
Edit: this was already in the repo, thank you still though 😀
Update: It seems like an error consistently pops up repeatedly. I'm leaving this issue as a personal note to submit a more detailed error log. TL;DR of it is that it claims LlavaLlamaForCasualLM has not been imported, and that llava.model doesn't have it.
May I ask if this issue has been resolved?