PaddleOCR Low per char confidence score on customized dataset when the text is clear.

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Linux
版本号/Version：Paddle： PaddleOCR：V4 问题相关组件/Related components：recognition
运行指令/Command Code：
完整报错/Complete Error Message：

请尽量不要包含图片在问题中/Please try to not include the image in the issue.

Mar 27 '24 19:03 OwenHuaKargo

Do you mean that individual characters score lower than most other characters？

Apr 29 '24 01:04 UserWangZz

@UserWangZz Yes

May 07 '24 21:05 OwenHuaKargo

@UserWangZz Yes

This is normal. The model will always encounter uncertainties. It may be characters that are difficult to distinguish, or it may be padding at the beginning and end of characters.

May 08 '24 01:05 UserWangZz

Thank you for your replay. What do you mean by the padding? Why does the uncertainty happen for clear characters?

May 08 '24 01:05 OwenHua666

Padding refers to the blank area at the beginning and end of the character. Different padding sizes will also affect the accuracy of recognition.

Regarding the second question, I cannot guarantee whether it is correct. It may be the angle of the characters, the barcode above, etc. There are many factors that affect the model's judgment of characters.

Did you segment the characters and calculate the confidence of each character?

May 08 '24 02:05 UserWangZz

Yes. I segmented to get the per character confidence.

May 08 '24 02:05 OwenHua666

So for this picture, it is very likely that the barcode above the first 8 affects the recognition of the model, because there is no barcode above the second 8

May 08 '24 06:05 UserWangZz

Two more follow-up questions. 1. how do we improve those clear characters with low confidence scores? 2. If we have leading padded space or trailing padded space, should I add extra spaces in the labeling step for my customized dataset? What if there is more than one space between two words in my dataset in a single text strip? Eg. "Name: Owen". Should I have more than one space in between in the labeling?

May 08 '24 14:05 OwenHuaKargo

The confidence scores is very important in your project? What I mean is that although the confidence score is low, the model did not recognize any errors. And in situations where the image is relatively clear, recognition errors are generally rare.

May 09 '24 02:05 UserWangZz

Yes, they are extremely important. There are cases in which we need to filter out the low-conf chars when the digits are not clear.

May 09 '24 22:05 OwenHuaKargo

Can you also give some suggestions for the multiple-spaces scenarios? Thx

May 09 '24 22:05 OwenHuaKargo

About this pic, for the first 8, can you segment it and deal it like the second 8, just del the bar-code which is above the num, then inference this 8, and check the confidence score. Let's check whether the barcode above the number affects the recognition of the model.

May 10 '24 01:05 UserWangZz

Do we have a more general fix for the above problem? The first image doesn't have a barcode.

May 15 '24 18:05 OwenHuaKargo

Would you be able to help with the multiple-spaces scenario? Eg. "Owen Hua" in a single quad polygon. How many spaces should I label for fine-tuning the recognition model?

May 15 '24 18:05 OwenHuaKargo

Would you be able to help with the multiple-spaces scenario? Eg. "Owen Hua" in a single quad polygon. How many spaces should I label for fine-tuning the recognition model?

Sorry, I didn't understand your question? Can you explain it in more detail?

May 16 '24 01:05 UserWangZz

There are more than one space in between my first and last name. How would I label it?

May 16 '24 02:05 OwenHua666

you can set use_space_char: true make the model to predict 'space' in rec stage

May 16 '24 08:05 UserWangZz

PaddleOCR PaddleOCR copied to clipboard

Low per char confidence score on customized dataset when the text is clear.

PaddleOCR
PaddleOCR copied to clipboard