InternVL issues

train 1B model on 32G V100 gpu ，flash_attention not support, any one train 1B model on V100? A100 cost expensive

4

addition: My task is an easy single picture classify, I find 1B model outperform Clip by a large margin, so wants to train 1B model on V100

CLIsVeryOK

How can I determine which region of the image the model is focusing on when answering a specific question?

2

How can I determine which region of the image the model is focusing on when answering a specific question?, Does InternVL use Cross-Attention between images and text? If so, how...

phongkhanh

预训练上下文问题

Hi：请问设置num_images_expected=48 max_packed_tokens=32000时实际的seqence length差不多是16k 可以正常跑，但是我设置num_images_expected=100 max_packed_tokens=32000时，就跑不起来（显存溢出），显存是8 * 8 * 80G, 原理上应该可以跑吧（文本的可以跑起来), 请问要支持32k代码里需要改什么地方么？谢谢

samaritan1998

[Bug] concat_pad_data_collator的pad_id为0可能有问题

### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest version....

MrChen314

[Bug] [Errno 2] No such file or directory eval/mmmu/evaluate_mmmu_cot.py

2

### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest version....

zzk6626

[Docs] For intergrading

3

### 📚 The doc issue Is there any tutor for integrating the vision model with the language model? ### Suggest a potential alternative/fix _No response_

Hert4

FineTune 78B model on multi-node slrum system

2

Hi, I would like to finetune on a multi-node GPU system. Each node has 8 A100. And the system uses slrum. I am not sure if the fineturn command below...

spcrobocar

关于InternVL定位多类别时的问题

2

您好，我根据官方文档制作了V3Det格式的问答结构，正常训练、推理，对于一张图片只有一类目标时基本可以正常检测到，这一类不论一个或多个目标大部分可以检测出来，例如一个人、五个人都能检测出来。但是如果一张图出现多个类别时，就只能检测出一个类别，例如训练包含了人和车，一张图有两个人、两辆车，就只能检测出其中一个类别的所有目标。训练较为充分，数据量也不少。训练问答样例结构如下： ``{"id": 24770, "image": "train/1707221130212.jpg", "width": 1600, "height": 900, "conversations": [{"from": "human", "value": "\n请检测下图中的所有目标并标记坐标位置"}, {"from": "gpt", "value": "道路上停放的车辆[[0,390,170,754]]\n道路上出现的人[[31,665,99,740],[95,667,141,727],[0,397,168,761]]\n"}]}`` 这个问题困扰了很久，我尝试了很多问答结构都没法解决多类别的检测，请问这是什么问题呢？我考虑过下面几种情况： 1：训练代码只读取了第一个类别的box； 2：训练的损失函数； 3：模型的输出问题；

daihuidai

关于InternVL3.5的Router training

7

Router training是纯计算指标类型来确定路由的吗？是不是不包含参数的训练？

JerryHUZY

Guidance on Creating Data and Training Intern VL Model on Unsloth

Hi team, I’m a new intern working on the VL project in `usloth`. I have read through the docs here: [Video Data Format - Intern VL](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#video-data) but I couldn’t find...

dauvannam1804

InternVL
InternVL copied to clipboard

Metadata

train 1B model on 32G V100 gpu ，flash_attention not support, any one train 1B model on V100? A100 cost expensive

How can I determine which region of the image the model is focusing on when answering a specific question?

预训练上下文问题

[Bug] concat_pad_data_collator的pad_id为0可能有问题

[Bug] [Errno 2] No such file or directory eval/mmmu/evaluate_mmmu_cot.py

[Docs] For intergrading

FineTune 78B model on multi-node slrum system

关于InternVL定位多类别时的问题

关于InternVL3.5的Router training

Guidance on Creating Data and Training Intern VL Model on Unsloth

← Metadata

Owner

Metadata

InternVL InternVL copied to clipboard

Metadata

← Metadata

Owner

Metadata

InternVL
InternVL copied to clipboard