PaddleOCR
PaddleOCR copied to clipboard
Bounding box in the same line break down into smaller ones after finetuning
Hi everyone, I have used a custom dataset (forms and documents) to finetune on chinese+english detection, using the following:
config: ch_PP-OCRv3_det_student.yml pretrain_model: ./pretrain_models/ch_PP-OCRv3_det_distill_train/student
Using pretrained model, the detected text in the same line of the document are in the same bounding box, while missing quite many text; After finetuning, recall increases but text in the same line are separated into a lot of smaller bounding boxes.
Does anyone experience the same issue?
Using pretrained model, the detected text in the same line of the document are in the same bounding box, because the training data are labeled in text-line level. After finetuning, recall increases but text in the same line are separated into a lot of smaller bounding boxes, maybe your training data are labeled in word-level? The annotation format is suggested to be unified to fully take advantage of the pretrained model.
Some tips: before finetuning the model, you can try to adjust the post-processing parameters, which can often boost the performance in form and documents scenes.
Yes, I am using a unified annotaion format (labelling all text in the same text line as one bbox). Thank you for the suggestion on modifying post-processing params.
I am reading through the tips in documentation (https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/finetune.md)
- PP-OCR提供的预训练模型有较好的泛化能力
- 加入少量真实数据(检测任务>=500张, 识别任务>=5000张),会大幅提升垂类场景的检测与识别效果
- 在模型微调时,加入真实通用场景数据,可以进一步提升模型精度与泛化性能
- 在图像检测任务中,增大图像的预测尺度,能够进一步提升较小文字区域的检测效果
- 在模型微调时,需要适当调整超参数(学习率,batch size最为重要),以获得更优的微调效果。
For point 2, is 真实数据 referring to scene text data such as those from ICDAR 2015 challenge?
- 在模型微调时,加入真实通用场景数据,可以进一步提升模型精度与泛化性能
真实数据 means data collected from real scene, other than synthesized.