tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

text2image - RTL - Null box at index 0

Open Shreeshrii opened this issue 5 years ago • 10 comments

 tesseract -v
tesseract 5.0.0-alpha-410-g6a95
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

Running text2image on single line texts helped isolate some lines which give error during processing. Hopefully this will help identify the bugs.

The text file, image are included in a zip file. Box file is not created.

ubuntu@tesseract-ocr:~/Sorani$ f=rtlorder.errors
ubuntu@tesseract-ocr:~/Sorani$ fontname="Unikurd Web"
ubuntu@tesseract-ocr:~/Sorani$ OMP_THREAD_LIMIT=1   text2image  --strip_unrenderable_words --xsize=3000 --ysize=152  --leading=32 --margin=12  --char_spacing=0.0 --exposure=0  --max_pages=0  --fonts_dir=/home/ubuntu/.fonts --font="$fontname" --text="$f".training_text  --outputbase="$f"
Rendered page 0 to file rtlorder.errors.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!

rtlorder errors

‪ ٢١‬تشرینی یەکەمی ‪ ١٩٩٣‬لە ماڵی محەمەد‬ ‫حەمۆدا کۆ

In this particular case, The line seems to have two phrases that are getting swapped.

rtlorder.errors.zip

Shreeshrii avatar Sep 17 '19 12:09 Shreeshrii

I just did copy and paste of the sentense and it looks okay.

‪ ٢١‬تشرینی یەکەمی ‪ ١٩٩٣‬لە ماڵی محەمەد‬ ‫حەمۆدا کۆ

amitdo avatar Oct 13 '19 04:10 amitdo

Please try running text2image with that single line as the text and see if box file is created. I get an error.

I had put the text in a div with direction RTL.

Shreeshrii avatar Oct 13 '19 04:10 Shreeshrii

Which font did you use? It seems to be written in Kurdish.

amitdo avatar Oct 13 '19 05:10 amitdo

OK, It's "Unikurd Web".

amitdo avatar Oct 13 '19 05:10 amitdo

I tried with Arial font also, same error.

ubuntu@tesseract-ocr:~/Sorani$ tesseract -v
tesseract 5.0.0-alpha-473-g6d171
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
ubuntu@tesseract-ocr:~/Sorani$ text2image -v
Using CAIRO_FONT_TYPE_FT.
5.0.0-alpha-473-g6d171
ubuntu@tesseract-ocr:~/Sorani$
ubuntu@tesseract-ocr:~/Sorani$ f=rtlorder.errors
ubuntu@tesseract-ocr:~/Sorani$ fontname="Unikurd Web"
ubuntu@tesseract-ocr:~/Sorani$ OMP_THREAD_LIMIT=1   text2image  --strip_unrenderable_words --xsize=3000 --ysize=152  --leading=32 --margin=12  --char_spacing=0.0 --exposure=0  --max_pages=0  --fonts_dir=/home/ubuntu/.fonts --font="$fontname" --text="$f".training_text  --outputbase="$f"
Rendered page 0 to file rtlorder.errors.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
ubuntu@tesseract-ocr:~/Sorani$
ubuntu@tesseract-ocr:~/Sorani$ f=rtlorder.errors
ubuntu@tesseract-ocr:~/Sorani$ fontname="Arial"
ubuntu@tesseract-ocr:~/Sorani$ OMP_THREAD_LIMIT=1   text2image  --strip_unrenderable_words --xsize=3000 --ysize=152  --leading=32 --margin=12  --char_spacing=0.0 --exposure=0  --max_pages=0  --fonts_dir=/home/ubuntu/.fonts --font="$fontname" --text="$f".training_text  --outputbase="$f"."$fontname"
Rendered page 0 to file rtlorder.errors.Arial.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!

One interesting thing about the generated image is that the text on it is left aligned, while usually with RTL languages, the images have text on right side of page.

Shreeshrii avatar Oct 13 '19 08:10 Shreeshrii

There is some problem when numbers, punctuation are at beginning/end of line.

Another example:

Iteration 949: GROUND  TRUTH : ،!!!...ێبهن یهوینهل ،شهیێپ .یهکب هویزهباد یکێزێه یكێتشهگ دنهبزیڕ کێسوونشهڕ هکێن
Iteration 949: BEST OCR TEXT : ل!!!...ێبهن یهوینهل شهیێپ .یهکب هویزهباد یکێزێه یکێتشهگ دنهبزیڕ کێسوونشهڕ هکێن

Notice the punctuation being at opposite end of lines in OCR and groundTruth.

Shreeshrii avatar Oct 13 '19 08:10 Shreeshrii

With the first example at least, it seems that the training text contains a unseen direction character that causes the issue.

amitdo avatar Oct 13 '19 10:10 amitdo

https://en.wikipedia.org/wiki/Bidirectional_Text#Table_of_possible_BiDi_character_types

amitdo avatar Oct 13 '19 10:10 amitdo

@amitdo Please look at the code at https://github.com/tesseract-ocr/tesseract/blob/master/src/training/boxchar.cpp#L233

I think it might be causing some of the errors

Or, maybe this needs a rewrite using ICU's BiDi algorithms ...

https://github.com/tesseract-ocr/tesseract/blob/cb0c024a6f92fd04c951add6cf7ff497625cfda0/src/ccmain/resultiterator.cpp#L64

http://userguide.icu-project.org/transforms/bidi

When text is displayed or printed, it must be "reordered" into visual order with some parts of the text laid out left-to-right, and other parts laid out right-to-left. The Unicode standard specifies an algorithm for this logical-to-visual reordering.

https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html#a88693e5a8ad4be974dc90ec6b8db56df

Shreeshrii avatar Oct 13 '19 11:10 Shreeshrii

Is there any solution for this. I think the issue is related with a bug in text2image: in relation to memory. I said that because I am getting a pattern on the number of lines with the null-box error. It processes correctly for a couple of lines: then --null box error for two lines, then normal process for a couple of lines, then null box for two lines etc.

I am having an issue with Left to right language (not right to left). The issue is both on Ubuntu and Mac: both on Tesseract 4 and 5. So, it is a deeper problem: not specific to language type or specific to an operating system.

DesBw avatar Sep 17 '23 08:09 DesBw