ERNIE icon indicating copy to clipboard operation
ERNIE copied to clipboard

paddleocr-vl finetuning dataset format

Open Theophylline opened this issue 1 month ago • 2 comments

Hi, I've been reading the instructions on how to finetune the paddleocr-VL model and I have some questions regarding how to prepare the finetuning dataset: https://github.com/PaddlePaddle/ERNIE/blob/release/v1.4/docs/paddleocr_vl_sft.md

Image
  1. Let's say I have a single page pdf image with some text, table, and images (see above). How should I generate the finetuning data in this case? Do I have to separate all text, tables, and images and create a finetuning dataset for each task?
  2. is it possible to train the paddleocr-VL model from scratch using ERNIE?
  3. Assume that my finetuning dataset only contains 1 task (say Table Recognition), how do you think this will impact the overall model performance?

Thank you so much!

Theophylline avatar Nov 08 '25 06:11 Theophylline

@Theophylline , Based on my experience, you shoulde get the crops of images(based on bboxes and labels) from a layout model , like PP-DocLayoutV2 here : https://github.com/PaddlePaddle/PaddleX/blob/release/3.3/paddlex/configs/pipelines/PaddleOCR-VL.yaml . If you use the official vllm docker , it is easy to get the instance to access. BR

jerrywind avatar Nov 09 '25 01:11 jerrywind

  1. You can use a layout detection model to crop out individual elements. As mentioned above, you can use the PP-DocLayoutV2 model for this purpose. Then, you can annotate each area separately. Alternatively, you can directly use PaddleOCR-VL, save the results in a JSON file to obtain the coordinates and recognition results of sub-areas, crop them out, and then adjust the labels based on the recognition results.

  2. You can set from_scratch=1.

  3. Yes, if you fine-tune only one specific task, theoretically, the model might experience some forgetting, leading to a decrease in accuracy for other tasks.

Sunting78 avatar Nov 10 '25 07:11 Sunting78