paddleocr-vl finetuning dataset format
Hi, I've been reading the instructions on how to finetune the paddleocr-VL model and I have some questions regarding how to prepare the finetuning dataset: https://github.com/PaddlePaddle/ERNIE/blob/release/v1.4/docs/paddleocr_vl_sft.md
- Let's say I have a single page pdf image with some text, table, and images (see above). How should I generate the finetuning data in this case? Do I have to separate all text, tables, and images and create a finetuning dataset for each task?
- is it possible to train the paddleocr-VL model from scratch using ERNIE?
- Assume that my finetuning dataset only contains 1 task (say Table Recognition), how do you think this will impact the overall model performance?
Thank you so much!
@Theophylline , Based on my experience, you shoulde get the crops of images(based on bboxes and labels) from a layout model , like PP-DocLayoutV2 here : https://github.com/PaddlePaddle/PaddleX/blob/release/3.3/paddlex/configs/pipelines/PaddleOCR-VL.yaml . If you use the official vllm docker , it is easy to get the instance to access. BR
-
You can use a layout detection model to crop out individual elements. As mentioned above, you can use the PP-DocLayoutV2 model for this purpose. Then, you can annotate each area separately. Alternatively, you can directly use PaddleOCR-VL, save the results in a JSON file to obtain the coordinates and recognition results of sub-areas, crop them out, and then adjust the labels based on the recognition results.
-
You can set
from_scratch=1. -
Yes, if you fine-tune only one specific task, theoretically, the model might experience some forgetting, leading to a decrease in accuracy for other tasks.