PaddleOCR Bounding box in the same line break down into smaller ones after finetuning

Bounding box in the same line break down into smaller ones after finetuning

Open kenho211 opened this issue 2 years ago • 4 comments

Hi everyone, I have used a custom dataset (forms and documents) to finetune on chinese+english detection, using the following:

config: ch_PP-OCRv3_det_student.yml pretrain_model: ./pretrain_models/ch_PP-OCRv3_det_distill_train/student

Using pretrained model, the detected text in the same line of the document are in the same bounding box, while missing quite many text; After finetuning, recall increases but text in the same line are separated into a lot of smaller bounding boxes.

Does anyone experience the same issue?

Oct 31 '22 22:10 kenho211

Using pretrained model, the detected text in the same line of the document are in the same bounding box, because the training data are labeled in text-line level. After finetuning, recall increases but text in the same line are separated into a lot of smaller bounding boxes, maybe your training data are labeled in word-level? The annotation format is suggested to be unified to fully take advantage of the pretrained model.

Some tips: before finetuning the model, you can try to adjust the post-processing parameters, which can often boost the performance in form and documents scenes.

Nov 01 '22 07:11 MissPenguin

Yes, I am using a unified annotaion format (labelling all text in the same text line as one bbox). Thank you for the suggestion on modifying post-processing params.

Nov 01 '22 08:11 kenho211

I am reading through the tips in documentation (https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/finetune.md)

PP-OCR提供的预训练模型有较好的泛化能力
加入少量真实数据（检测任务>=500张, 识别任务>=5000张），会大幅提升垂类场景的检测与识别效果
在模型微调时，加入真实通用场景数据，可以进一步提升模型精度与泛化性能
在图像检测任务中，增大图像的预测尺度，能够进一步提升较小文字区域的检测效果
在模型微调时，需要适当调整超参数（学习率，batch size最为重要），以获得更优的微调效果。

For point 2, is 真实数据 referring to scene text data such as those from ICDAR 2015 challenge?

Nov 01 '22 19:11 kenho211

在模型微调时，加入真实通用场景数据，可以进一步提升模型精度与泛化性能

真实数据 means data collected from real scene, other than synthesized.

Nov 10 '22 02:11 MissPenguin

PaddleOCR PaddleOCR copied to clipboard

Bounding box in the same line break down into smaller ones after finetuning

PaddleOCR
PaddleOCR copied to clipboard