tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Latin words inside Arabic text issue

Open naourass opened this issue 1 year ago • 0 comments

Environment

  • Tesseract Version: tesseract 4.1.1 leptonica-1.79.0
  • Commit Number: installed through apt install tesseract-ocr
  • Platform: Linux DESKTOP-xxxxxxx 5.10.102.1-microsoft-standard-WSL2 (Ubuntu 20.04)

Current Behavior:

Tesseract fails to recognize Latin words inside Arabic paragraphs. In the example below, the word Fipar-Holding is not recognized as Latin.

1. Input image : input_image.png

2. Command : tesseract ./input_image.png - -l fra+ara

3. Results :

قرارلمجلس المنافسة عدد 0028/ق/2022 صبادر25 من شعبان 1443 (28 مارس 2022) والمتعلق بتولي الشركة القابضة للمساهمات والاستثمارات «~~11010108-:2م1]~~» للمر اقبة المشتركة على شركة ‎«CMGP Group Sa»‏ وذلك عبراقتناء نسبة14,81 96 من أسيم رأسمالها وحقوق التصويت المرتبطة به.

Expected Behavior:

I expect tesseract to offer a good recognition rate for Latin words inside Arabic text. In my current use case, I have too many incorrect recognition of Latin words inside Arabic which is preventing me from using the solution in production.

Suggested Fix:

Here's what I've tried so far, but unfortunately none of these attempts has fixed the issue:

  • Tried different language model combinations (ara, Arabic, eng, fra - all from tessdata_best) in all possible orders
  • Finetuned ara, fra, and eng on the font used in the input image (250 pages each, 4800 iterations)
  • Upscaled the image x2, x4 and x8
  • Tried some erosion and dilatation

I ran out of ideas. While inspecting further, any workaround would be highly appreciated !

naourass avatar Sep 02 '22 17:09 naourass