Zhe Chen comments

Results 316 comments of


                                            Zhe Chen

v100跑推理报错No module named 'transformers_modules.InternVL-Chat-V1'

您好，这个应该是文件名中带了点号，请把1.5写成1_5

如何直接调用这两个模型。InternVL-G/C

您好，请参考这个代码来调用InternVL-C和InternVL-G: https://huggingface.co/OpenGVLab/InternVL-14B-224px#model-usage

``` import torch from PIL import Image from transformers import AutoModel, CLIPImageProcessor from transformers import AutoTokenizer model = AutoModel.from_pretrained( 'OpenGVLab/InternVL-14B-224px', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).cuda().eval() image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px') tokenizer = AutoTokenizer.from_pretrained( 'OpenGVLab/InternVL-14B-224px',...

Image transformation for InternVL-1.5

Hello, in fact, in the gradio_web_server image transformation is also done, but the code is a little different. In gradio_web_server, the image transformation is conducted by CLIPImageProcessor, which is actually...

想问下，模型pretrain的时候用了那个类似UHD的切图吗？

对的，我们pretrain就切了12个块，从实验结果看，pretrain和finetune对齐切图策略的性能是最好的。如果pretrain不切，只在finetune切，会有1-2个点的性能下降

想问下，模型pretrain的时候用了那个类似UHD的切图吗？

> > 对的，我们pretrain就切了12个块，从实验结果看，pretrain和finetune对齐切图策略的性能是最好的。如果pretrain不切，只在finetune切，会有1-2个点的性能下降 > > 感谢分享，你们卡真是充足（笑），另外问下为什么从Y i-34B切换回internLM2 20B了，按照论文的理论，越大的LLM应该和Intern vit6B配合的越好啊？而且从其他一些数据上Yi34B确实效果好于20B Yi34B效果确实好，我们跑的新的40B模型，点数比现在开源的这个26B的有大幅提升，每个数据集都涨了好几个点，就是那个太大了估计也没什么人跑得动，所以还没放出来。

想问下，模型pretrain的时候用了那个类似UHD的切图吗？

> > 对的，我们pretrain就切了12个块，从实验结果看，pretrain和finetune对齐切图策略的性能是最好的。如果pretrain不切，只在finetune切，会有1-2个点的性能下降 > > 另外论文中没写出来，训练过程中，如果图片大小不够切12块是怎么处理的？全0吗训练是动态分辨率的，1-12个块都可以，切出来是几个块就用几个块训练，不会强行pad到12个块

想问下，模型pretrain的时候用了那个类似UHD的切图吗？

> > > > 对的，我们pretrain就切了12个块，从实验结果看，pretrain和finetune对齐切图策略的性能是最好的。如果pretrain不切，只在finetune切，会有1-2个点的性能下降 > > > > > > > > > 另外论文中没写出来，训练过程中，如果图片大小不够切12块是怎么处理的？全0吗 > > > > > > 训练是动态分辨率的，1-12个块都可以，切出来是几个块就用几个块训练，不会强行pad到12个块 > > 👌最后一个问题，之前有篇论文讲的是LLM的base和chat版本比base版本更适合MLLM训练。我看论文里强调你们用的是chat版本，也是实验结果比较好吗我感觉好像大多数人在用chat模型做多模态训练，我有试过对比base模型和chat模型，chat模型的benchmark点数更高。

--freeze_backbone False?

Hello, this is the fine-tuning script. When we fine-tune, we open the entire model and train it.

--freeze_backbone False?

Yes, in my experiments, turning on the vision encoder was significantly better than freezing it, so in all recent experiments, I have turned on the vision encoder during the finetune...