PaddleOCR Bug when doing CTC/SAR MultiLabel encode with Arabic

Bug when doing CTC/SAR MultiLabel encode with Arabic

Open Hegelim opened this issue 1 year ago • 1 comments

Problem

First off, could someone please help reopen this issue #10806 I accidentally closed it and then the bot closed it, seems no way to open it myself. This issue is directly related to the problem I mentioned in #10806. If you have any Arabic text, by nature if you read character by character in the Arabic text such as using a for loop, Python would read it from right to left. So let's say you want to train some Arabic recognition model, and that your ground-truth label is written left to right, as any English speakers, then you need to be very careful when using for loop to read the character. So whenever there is "MultiLabelEncode" in the config.yaml file for training, the code here https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/data/imaug/label_ops.py#L153 would give a problem in this scenario.

Fix

a way to fix is to use the Python bidi package

from bidi.algorithm import get_display
for char in get_display(text, base_dir="L"):

Here the argument base_dir="L" is very important. This would iterate through the characters in the left to right order.

系统环境/System Environment：Ubuntu 18.04
版本号/Version：Paddle： 2.7

Sep 06 '23 22:09 Hegelim

Any update on this and whether the fix is correct @Hegelim?

Struggling to get sensible recognition from Paddle for Arabic.

May 07 '24 08:05 connorourke

PaddleOCR PaddleOCR copied to clipboard

Bug when doing CTC/SAR MultiLabel encode with Arabic

Problem

Fix

PaddleOCR
PaddleOCR copied to clipboard