Questions about the pre-training data format
Hi, I want to continue pre-training based on the pre-trained model in the first stage, and then SFT,and I have some questions. 1, where is the weight of the pre-trained model in the first stage? 2, what is the format of 'path/to/pretrain/data.json', can you give a sample file similar to internvl_1_2_finetune.json? 3, To continue pre-training, what is the minimum size of data required?
mark mark
mark
Hello, the pre-training weights from the first stage are essentially the MLP projector, and we will release them shortly. Additionally, the data format for our pre-training is consistent with the SFT stage. For continued pre-training, I recommend using at least 1M or more data.
May I ask when this pre-train weight will be released, is it for all series? And will the pre-training data especially OCR data be released ? thanks.
Hello, we are planning to release some pre-trained OCR data, but the dataset is quite large, consisting of tens of millions of entries, so it will take some time to organize the files.