tesserocr GetUTF8Text produces differently on RIL.WORD

GetUTF8Text produces differently on RIL.WORD

Open kkurni opened this issue 7 years ago • 1 comments

Environment Tesseract Version: 3.05.02 Commit Number: Platform: Mac - Darwin 186590d0071d.ant.amazon.com 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64 x86_64 tesseract 3.05.02 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.8

Current Behavior: I notice that GetUTF8Text on Word boundary returns different / worst text results compare to Line boundary.

boxes = api.GetComponentImages(tr.RIL.WORD,True)
# get full text
full_text = api.GetUTF8Text()
print("FULL TEXT", full_text)
    
for i, (im, box, block_id, paragraph_id) in enumerate(boxes):
    text = api.GetUTF8Text()

The full text and the text return different results. The Full-text results are better than the word text.

Here is the image. screen shot 2018-09-13 at 1 02 38 pm

Full-text detection: FULL TEXT precision—speciﬁcally, to be able to show median debt burden lo one decimal place rather than an inlege}:

Here is the full line text detection with AUTO-OSD PSM screen shot 2018-09-13 at 1 03 29 pm

Here is the Word Text detection with AUTO-OSD screen shot 2018-09-13 at 1 04 37 pm PSM

Here is the Word Text detection with SIngle Line PSM screen shot 2018-09-13 at 1 03 38 pm

Expected Behavior: I am expecting the same text results produces from line/words detection.

Is there any reason why Text Results are produced differently from Line vs Word detections?

Sep 13 '18 20:09 kkurni

Yes there is. For one, on the line level, Tesseract has a longer context to decide over individual characters and words. It uses a language model for this (i.e. a stochastic model of word sequences). Also, it is free to position words and characters within a line, depending on the recognition results – whereas the isolated word boundaries (boxes) are derived from an independent layout analysis (and can therefore be suboptimal).

Jan 25 '19 19:01 bertsky

tesserocr tesserocr copied to clipboard

GetUTF8Text produces differently on RIL.WORD

tesserocr
tesserocr copied to clipboard