InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Feature] Request for InternVL3 Dataset Release

Open abigcatcat opened this issue 8 months ago • 2 comments

Motivation

Hi, thanks for your great work on this project!

I’m very interested in the dataset you used and would like to know when you are planning to release it. It would be really helpful for reproduction and further research.

Looking forward to your response. Thanks again!

Related resources

No response

Additional context

No response

abigcatcat avatar Apr 15 '25 08:04 abigcatcat

Yes, we are currently working on organizing the dataset and preparing it for release. Thank you for your interest — we appreciate your patience!

Lechatelia avatar Apr 16 '25 15:04 Lechatelia

what kind of data will be release, the pretrain and sft data? @Lechatelia

roadcode avatar Apr 18 '25 07:04 roadcode

@roadcode I wonder whether you have a plan to release the additional tool usage dataset that has been newly added on top of the InternVL-2.5 training set for SFT.

I have another question. I wonder what the exact SFT dataset used in InternVL2.5 is. For the other dataset used in InternVL2.5, are the datasets described in this document correct?

In the tech report (pp.6)

Data. For SFT data, we construct the training corpora based on those used in InternVL2.5 [18] while introducing additional tool usage, 3D scene understanding, GUI operations, scientific diagrams, creative writing, and multimodal reasoning samples. As a result, the number of training samples grows from 16.3M in InternVL2.5 to 21.7M in InternVL3.

Thanks in advance.

youngwanLEE avatar Jun 10 '25 03:06 youngwanLEE