InternVL Questions about the pre-training data format

Hi, I want to continue pre-training based on the pre-trained model in the first stage, and then SFT,and I have some questions. 1, where is the weight of the pre-trained model in the first stage? 2, what is the format of 'path/to/pretrain/data.json', can you give a sample file similar to internvl_1_2_finetune.json? 3, To continue pre-training, what is the minimum size of data required？

Jul 02 '24 04:07 sky-fly97

mark mark

Jul 13 '24 11:07 HarrytheOrange

mark

Jul 31 '24 10:07 gujiacheng

Hello, the pre-training weights from the first stage are essentially the MLP projector, and we will release them shortly. Additionally, the data format for our pre-training is consistent with the SFT stage. For continued pre-training, I recommend using at least 1M or more data.

Aug 26 '24 04:08 czczup

May I ask when this pre-train weight will be released, is it for all series? And will the pre-training data especially OCR data be released ? thanks.

Aug 27 '24 08:08 nemonameless

Hello, we are planning to release some pre-trained OCR data, but the dataset is quite large, consisting of tens of millions of entries, so it will take some time to organize the files.

Sep 07 '24 08:09 czczup