InternVL InternVL-Chat-ViT-6B-Vicuna-7B 模型权重问题

有两个问题请教一下：

https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B/blob/main/pytorch_model.bin.index.json model.vision_tower.vision_tower.embeddings.position_embedding 这个权重的 shape 是 1x577x32000，但是InternViT-6B-224px 里面对应权重的 shape 是 1x257x32000，我看到你们在这个PR里面https://github.com/OpenGVLab/InternVL/commit/c82d6ce30f512b33c58615088233a263112ae727 把 resize_pos_embeddings 删掉了，按照最新代码train的逻辑，似乎这个尺寸不应该变成577，是hf上面的模型没更新么？(hf上面的模型是tune_vit_pos_embedding 过的？现在这个值好像都是False)

按照llava的加载逻辑（load_pretrained_model），因为LlavaMetaModel中加载vision_tower的时候delay_load=True，所以vision_tower最后加载，且不会管llm模型中vision_tower相关的权重，如果tune_vit_pos_embedding的话，那么这里的逻辑应该有问题吧。

Apr 10 '24 08:04 irexyc

因为旧的那个resize操作不支持deepspeed zero3，所以做了修改；现在是在ViT inference的过程中根据feature的大小去动态resize pos embedding。

Apr 16 '24 15:04 czczup

您的意思是现在的代码去测试旧的模型会有bug是吗，我去确认一下。

Apr 16 '24 15:04 czczup

方便的话，可以分享一下 OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B 的 inference 代码

Apr 16 '24 16:04 irexyc

https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5
这里有推理代码，接着如何使用奥 @irexyc

Apr 21 '24 05:04 sunjunlishi

@sunjunlishi

你这个是vit的代码吧，我想知道的是 https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B 这个code的推理代码用的是哪个。

用 llava.serve.model_worker 加载模型的话，加载模型用的还是load_pretrained_model这个函数。

对 InternVL-Chat-ViT-6B-Vicuna-7B 来讲，我认为会有问题。InternVL-Chat-ViT-6B-Vicuna-7B 里面的vit的权重和InternViT-6B-224px 关于position_embedding的是不一致的。看起来是要用InternVL-Chat-ViT-6B-Vicuna-7B里面的才对，但是这个函数的逻辑，加载LLM的时候会delay_load vision_tower, 最后加载vision_tower的时候实际上加载的vision tower，iamge_preprocessor都是InternViT-6B-224px 里面的。

Apr 22 '24 02:04 irexyc

@irexyc 你看下这个有1.5版本有人4bit量化，我就是没有调用成功。

https://huggingface.co/failspy/InternVL-Chat-V1-5-4bit

Apr 23 '24 11:04 sunjunlishi

@irexyc 我当前对ocr，先走ocr识别，识别完后，作为多模态的上下文。基本也可以有个不错的效果。

Apr 26 '24 05:04 sunjunlishi

我今天测试了 InternVL-Chat-ViT-6B-Vicuna-7B, InternVL-Chat-ViT-6B-Vicuna-13B这两个模型，使用这个目录： https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat_llava/scripts_internvl/eval 下面的评测脚本运行这些模型是可以正常运行的。

Apr 27 '24 13:04 czczup

感谢。

Apr 28 '24 02:04 sunjunlishi

@czczup

请问具体是哪个脚本？

如我上面说的，以 https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat_llava/scripts_internvl/eval/mmbench.sh 为例，加载模型用的是这一行 https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat_llava/llava/eval/model_vqa_mmbench.py#L11

加载模型用的函数并没有显示的对 vision_tower 进行 resize_pos_embeddings https://github.com/OpenGVLab/InternVL/blob/0f76e06fd202cbd60fa338afbb51167fb3eda7de/internvl_chat_llava/llava/model/builder.py#L137-L141

然而 InternVL-Chat-ViT-6B-Vicuna-7B 这个模型权重里面，vit 的权重跟 OpenGVLab/InternViT-6B-224px embeddings layer 不一样（可以去看下 embeddings.position_embedding 的尺寸）。这难道没有问题么？

Apr 28 '24 03:04 irexyc

huggingface上的是旧的权重，没有重新训练，不过跑测试也能正常出结果。您说的有问题具体指的是什么，会报错吗

大佬，是要往lmdeploy里整合这个模型吗，有问题我可以帮忙解决。

Apr 28 '24 06:04 czczup

我指的不是会报错，而是我认为你们的代码有问题，或者不完整。

而是从目前的代码上看，InternVL-Chat-ViT-6B-Vicuna-7B 对应的预处理、vision 是（336, 336）尺寸的。

而 OpenGVLab/InternViT-6B-224px 这个模型对应的预处理、vision 是 (224 x 224）的。用 load_pretrained_model 这个函数加载的话最终加载的是 (224 x 224) 这个尺寸的。

mm_projector 和 vision_tower 的权重应该要对应吧。从huggingface的权重来看，vision tower 输入尺寸是336 x 336的，但是目前的代码加载的预处理以及vision是 224 x 224的。这个难道没问题么？（相当于在336 x 336的输入下训练的encoder 和 projector，但是推理的时候却用了 224 x 224的输入)

Internvl-chat 的几个模型 lmdeploy 已经支持过了，关于这部分的逻辑，lmdeploy 是这么改的 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/vl/model/internvl_llava.py#L65-L80

Apr 28 '24 06:04 irexyc

好的，感谢。我修一下这个问题

Apr 28 '24 07:04 czczup