tesserocr
tesserocr copied to clipboard
GetUTF8Text produces differently on RIL.WORD
Environment Tesseract Version: 3.05.02 Commit Number: Platform: Mac - Darwin 186590d0071d.ant.amazon.com 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64 x86_64 tesseract 3.05.02 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.8
Current Behavior: I notice that GetUTF8Text on Word boundary returns different / worst text results compare to Line boundary.
boxes = api.GetComponentImages(tr.RIL.WORD,True)
# get full text
full_text = api.GetUTF8Text()
print("FULL TEXT", full_text)
for i, (im, box, block_id, paragraph_id) in enumerate(boxes):
text = api.GetUTF8Text()
The full text and the text return different results. The Full-text results are better than the word text.
Here is the image.

Full-text detection: FULL TEXT precision—specifically, to be able to show median debt burden lo one decimal place rather than an inlege}:
Here is the full line text detection with AUTO-OSD PSM

Here is the Word Text detection with AUTO-OSD
PSM
Here is the Word Text detection with SIngle Line PSM

Expected Behavior: I am expecting the same text results produces from line/words detection.
Is there any reason why Text Results are produced differently from Line vs Word detections?
Yes there is. For one, on the line level, Tesseract has a longer context to decide over individual characters and words. It uses a language model for this (i.e. a stochastic model of word sequences). Also, it is free to position words and characters within a line, depending on the recognition results – whereas the isolated word boundaries (boxes) are derived from an independent layout analysis (and can therefore be suboptimal).