[Feature] Request for InternVL3 Dataset Release
Motivation
Hi, thanks for your great work on this project!
I’m very interested in the dataset you used and would like to know when you are planning to release it. It would be really helpful for reproduction and further research.
Looking forward to your response. Thanks again!
Related resources
No response
Additional context
No response
Yes, we are currently working on organizing the dataset and preparing it for release. Thank you for your interest — we appreciate your patience!
what kind of data will be release, the pretrain and sft data? @Lechatelia
@roadcode I wonder whether you have a plan to release the additional tool usage dataset that has been newly added on top of the InternVL-2.5 training set for SFT.
I have another question. I wonder what the exact SFT dataset used in InternVL2.5 is. For the other dataset used in InternVL2.5, are the datasets described in this document correct?
In the tech report (pp.6)
Data. For SFT data, we construct the training corpora based on those used in InternVL2.5 [18] while introducing additional tool usage, 3D scene understanding, GUI operations, scientific diagrams, creative writing, and multimodal reasoning samples. As a result, the number of training samples grows from 16.3M in InternVL2.5 to 21.7M in InternVL3.
Thanks in advance.