ERNIE paddleocr-vl finetuning dataset format

Hi, I've been reading the instructions on how to finetune the paddleocr-VL model and I have some questions regarding how to prepare the finetuning dataset: https://github.com/PaddlePaddle/ERNIE/blob/release/v1.4/docs/paddleocr_vl_sft.md

Let's say I have a single page pdf image with some text, table, and images (see above). How should I generate the finetuning data in this case? Do I have to separate all text, tables, and images and create a finetuning dataset for each task?
is it possible to train the paddleocr-VL model from scratch using ERNIE?
Assume that my finetuning dataset only contains 1 task (say Table Recognition), how do you think this will impact the overall model performance?

Thank you so much!

Nov 08 '25 06:11 Theophylline

@Theophylline , Based on my experience, you shoulde get the crops of images(based on bboxes and labels) from a layout model , like PP-DocLayoutV2 here : https://github.com/PaddlePaddle/PaddleX/blob/release/3.3/paddlex/configs/pipelines/PaddleOCR-VL.yaml . If you use the official vllm docker , it is easy to get the instance to access. BR

Nov 09 '25 01:11 jerrywind

You can use a layout detection model to crop out individual elements. As mentioned above, you can use the PP-DocLayoutV2 model for this purpose. Then, you can annotate each area separately. Alternatively, you can directly use PaddleOCR-VL, save the results in a JSON file to obtain the coordinates and recognition results of sub-areas, crop them out, and then adjust the labels based on the recognition results.
You can set from_scratch=1.
Yes, if you fine-tune only one specific task, theoretically, the model might experience some forgetting, leading to a decrease in accuracy for other tasks.

Nov 10 '25 07:11 Sunting78