tesseract
tesseract copied to clipboard
Latin words inside Arabic text issue
Environment
- Tesseract Version: tesseract 4.1.1 leptonica-1.79.0
-
Commit Number: installed through
apt install tesseract-ocr
- Platform: Linux DESKTOP-xxxxxxx 5.10.102.1-microsoft-standard-WSL2 (Ubuntu 20.04)
Current Behavior:
Tesseract fails to recognize Latin words inside Arabic paragraphs. In the example below, the word Fipar-Holding is not recognized as Latin.
1. Input image :
2. Command :
tesseract ./input_image.png - -l fra+ara
3. Results :
قرارلمجلس المنافسة عدد 0028/ق/2022 صبادر25 من شعبان 1443 (28 مارس 2022) والمتعلق بتولي الشركة القابضة للمساهمات والاستثمارات «~~11010108-:2م1]~~» للمر اقبة المشتركة على شركة «CMGP Group Sa» وذلك عبراقتناء نسبة14,81 96 من أسيم رأسمالها وحقوق التصويت المرتبطة به.
Expected Behavior:
I expect tesseract to offer a good recognition rate for Latin words inside Arabic text. In my current use case, I have too many incorrect recognition of Latin words inside Arabic which is preventing me from using the solution in production.
Suggested Fix:
Here's what I've tried so far, but unfortunately none of these attempts has fixed the issue:
- Tried different language model combinations (ara, Arabic, eng, fra - all from tessdata_best) in all possible orders
- Finetuned ara, fra, and eng on the font used in the input image (250 pages each, 4800 iterations)
- Upscaled the image x2, x4 and x8
- Tried some erosion and dilatation
I ran out of ideas. While inspecting further, any workaround would be highly appreciated !