tesseract
tesseract copied to clipboard
text2image - RTL - Null box at index 0
tesseract -v
tesseract 5.0.0-alpha-410-g6a95
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
Running text2image on single line texts helped isolate some lines which give error during processing. Hopefully this will help identify the bugs.
The text file, image are included in a zip file. Box file is not created.
ubuntu@tesseract-ocr:~/Sorani$ f=rtlorder.errors
ubuntu@tesseract-ocr:~/Sorani$ fontname="Unikurd Web"
ubuntu@tesseract-ocr:~/Sorani$ OMP_THREAD_LIMIT=1 text2image --strip_unrenderable_words --xsize=3000 --ysize=152 --leading=32 --margin=12 --char_spacing=0.0 --exposure=0 --max_pages=0 --fonts_dir=/home/ubuntu/.fonts --font="$fontname" --text="$f".training_text --outputbase="$f"
Rendered page 0 to file rtlorder.errors.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
In this particular case, The line seems to have two phrases that are getting swapped.
I just did copy and paste of the sentense and it looks okay.
٢١تشرینی یەکەمی ١٩٩٣لە ماڵی محەمەد حەمۆدا کۆ
Please try running text2image with that single line as the text and see if box file is created. I get an error.
I had put the text in a div with direction RTL.
Which font did you use? It seems to be written in Kurdish.
OK, It's "Unikurd Web".
I tried with Arial font also, same error.
ubuntu@tesseract-ocr:~/Sorani$ tesseract -v
tesseract 5.0.0-alpha-473-g6d171
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
ubuntu@tesseract-ocr:~/Sorani$ text2image -v
Using CAIRO_FONT_TYPE_FT.
5.0.0-alpha-473-g6d171
ubuntu@tesseract-ocr:~/Sorani$
ubuntu@tesseract-ocr:~/Sorani$ f=rtlorder.errors
ubuntu@tesseract-ocr:~/Sorani$ fontname="Unikurd Web"
ubuntu@tesseract-ocr:~/Sorani$ OMP_THREAD_LIMIT=1 text2image --strip_unrenderable_words --xsize=3000 --ysize=152 --leading=32 --margin=12 --char_spacing=0.0 --exposure=0 --max_pages=0 --fonts_dir=/home/ubuntu/.fonts --font="$fontname" --text="$f".training_text --outputbase="$f"
Rendered page 0 to file rtlorder.errors.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
ubuntu@tesseract-ocr:~/Sorani$
ubuntu@tesseract-ocr:~/Sorani$ f=rtlorder.errors
ubuntu@tesseract-ocr:~/Sorani$ fontname="Arial"
ubuntu@tesseract-ocr:~/Sorani$ OMP_THREAD_LIMIT=1 text2image --strip_unrenderable_words --xsize=3000 --ysize=152 --leading=32 --margin=12 --char_spacing=0.0 --exposure=0 --max_pages=0 --fonts_dir=/home/ubuntu/.fonts --font="$fontname" --text="$f".training_text --outputbase="$f"."$fontname"
Rendered page 0 to file rtlorder.errors.Arial.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
One interesting thing about the generated image is that the text on it is left aligned, while usually with RTL languages, the images have text on right side of page.
There is some problem when numbers, punctuation are at beginning/end of line.
Another example:
Iteration 949: GROUND TRUTH : ،!!!...ێبهن یهوینهل ،شهیێپ .یهکب هویزهباد یکێزێه یكێتشهگ دنهبزیڕ کێسوونشهڕ هکێن
Iteration 949: BEST OCR TEXT : ل!!!...ێبهن یهوینهل شهیێپ .یهکب هویزهباد یکێزێه یکێتشهگ دنهبزیڕ کێسوونشهڕ هکێن
Notice the punctuation being at opposite end of lines in OCR and groundTruth.
With the first example at least, it seems that the training text contains a unseen direction character that causes the issue.
https://en.wikipedia.org/wiki/Bidirectional_Text#Table_of_possible_BiDi_character_types
@amitdo Please look at the code at https://github.com/tesseract-ocr/tesseract/blob/master/src/training/boxchar.cpp#L233
I think it might be causing some of the errors
Or, maybe this needs a rewrite using ICU's BiDi algorithms ...
https://github.com/tesseract-ocr/tesseract/blob/cb0c024a6f92fd04c951add6cf7ff497625cfda0/src/ccmain/resultiterator.cpp#L64
http://userguide.icu-project.org/transforms/bidi
When text is displayed or printed, it must be "reordered" into visual order with some parts of the text laid out left-to-right, and other parts laid out right-to-left. The Unicode standard specifies an algorithm for this logical-to-visual reordering.
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html#a88693e5a8ad4be974dc90ec6b8db56df
Is there any solution for this. I think the issue is related with a bug in text2image: in relation to memory. I said that because I am getting a pattern on the number of lines with the null-box error. It processes correctly for a couple of lines: then --null box error for two lines, then normal process for a couple of lines, then null box for two lines etc.
I am having an issue with Left to right language (not right to left). The issue is both on Ubuntu and Mac: both on Tesseract 4 and 5. So, it is a deeper problem: not specific to language type or specific to an operating system.