InternVL
InternVL copied to clipboard
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
addition: My task is an easy single picture classify, I find 1B model outperform Clip by a large margin, so wants to train 1B model on V100
How can I determine which region of the image the model is focusing on when answering a specific question?, Does InternVL use Cross-Attention between images and text? If so, how...
Hi: 请问设置num_images_expected=48 max_packed_tokens=32000时 实际的seqence length差不多是16k 可以正常跑,但是我设置num_images_expected=100 max_packed_tokens=32000时,就跑不起来(显存溢出),显存是8 * 8 * 80G, 原理上应该可以跑吧(文本的可以跑起来), 请问要支持32k代码里需要改什么地方么? 谢谢
### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest version....
### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest version....
### 📚 The doc issue Is there any tutor for integrating the vision model with the language model? ### Suggest a potential alternative/fix _No response_
Hi, I would like to finetune on a multi-node GPU system. Each node has 8 A100. And the system uses slrum. I am not sure if the fineturn command below...
您好,我根据官方文档制作了V3Det格式的问答结构,正常训练、推理,对于一张图片只有一类目标时基本可以正常检测到,这一类不论一个或多个目标大部分可以检测出来,例如一个人、五个人都能检测出来。 但是如果一张图出现多个类别时,就只能检测出一个类别,例如训练包含了人和车,一张图有两个人、两辆车,就只能检测出其中一个类别的所有目标。 训练较为充分,数据量也不少。 训练问答样例结构如下: ``{"id": 24770, "image": "train/1707221130212.jpg", "width": 1600, "height": 900, "conversations": [{"from": "human", "value": "\n请检测下图中的所有目标并标记坐标位置"}, {"from": "gpt", "value": "道路上停放的车辆[[0,390,170,754]]\n道路上出现的人[[31,665,99,740],[95,667,141,727],[0,397,168,761]]\n"}]}`` 这个问题困扰了很久,我尝试了很多问答结构都没法解决多类别的检测,请问这是什么问题呢? 我考虑过下面几种情况: 1:训练代码只读取了第一个类别的box; 2:训练的损失函数; 3:模型的输出问题;
Router training是纯计算指标类型来确定路由的吗?是不是不包含参数的训练?
Hi team, I’m a new intern working on the VL project in `usloth`. I have read through the docs here: [Video Data Format - Intern VL](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#video-data) but I couldn’t find...