InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

Questions about the pre-training data format

Open sky-fly97 opened this issue 1 year ago • 5 comments

Hi, I want to continue pre-training based on the pre-trained model in the first stage, and then SFT,and I have some questions. 1, where is the weight of the pre-trained model in the first stage? 2, what is the format of 'path/to/pretrain/data.json', can you give a sample file similar to internvl_1_2_finetune.json? 3, To continue pre-training, what is the minimum size of data required?

sky-fly97 avatar Jul 02 '24 04:07 sky-fly97

mark mark

HarrytheOrange avatar Jul 13 '24 11:07 HarrytheOrange

mark

gujiacheng avatar Jul 31 '24 10:07 gujiacheng

Hello, the pre-training weights from the first stage are essentially the MLP projector, and we will release them shortly. Additionally, the data format for our pre-training is consistent with the SFT stage. For continued pre-training, I recommend using at least 1M or more data.

czczup avatar Aug 26 '24 04:08 czczup

May I ask when this pre-train weight will be released, is it for all series? And will the pre-training data especially OCR data be released ? thanks.

nemonameless avatar Aug 27 '24 08:08 nemonameless

Hello, we are planning to release some pre-trained OCR data, but the dataset is quite large, consisting of tens of millions of entries, so it will take some time to organize the files.

czczup avatar Sep 07 '24 08:09 czczup