tesseract Latin words inside Arabic text issue

Latin words inside Arabic text issue

Open naourass opened this issue 1 year ago • 0 comments

Environment

Tesseract Version: tesseract 4.1.1 leptonica-1.79.0
Commit Number: installed through apt install tesseract-ocr
Platform: Linux DESKTOP-xxxxxxx 5.10.102.1-microsoft-standard-WSL2 (Ubuntu 20.04)

Current Behavior:

Tesseract fails to recognize Latin words inside Arabic paragraphs. In the example below, the word Fipar-Holding is not recognized as Latin.

1. Input image :

2. Command : tesseract ./input_image.png - -l fra+ara

3. Results :

قرارلمجلس المنافسة عدد 0028/ق/2022 صبادر25 من شعبان 1443 (28 مارس 2022) والمتعلق بتولي الشركة القابضة للمساهمات والاستثمارات «~~11010108-:2م1]~~» للمر اقبة المشتركة على شركة ‎«CMGP Group Sa»‏ وذلك عبراقتناء نسبة14,81 96 من أسيم رأسمالها وحقوق التصويت المرتبطة به.

Expected Behavior:

I expect tesseract to offer a good recognition rate for Latin words inside Arabic text. In my current use case, I have too many incorrect recognition of Latin words inside Arabic which is preventing me from using the solution in production.

Suggested Fix:

Here's what I've tried so far, but unfortunately none of these attempts has fixed the issue:

Tried different language model combinations (ara, Arabic, eng, fra - all from tessdata_best) in all possible orders
Finetuned ara, fra, and eng on the font used in the input image (250 pages each, 4800 iterations)
Upscaled the image x2, x4 and x8
Tried some erosion and dilatation

I ran out of ideas. While inspecting further, any workaround would be highly appreciated !

Sep 02 '22 17:09 naourass

tesseract tesseract copied to clipboard

Latin words inside Arabic text issue

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

tesseract
tesseract copied to clipboard