tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

Tesserocr does not read UZN files

Open DevKretov opened this issue 2 years ago • 2 comments

Hello,

when I want to specify the regions of interest via .UZN file (zones file), tesserocr does not pay attention to this file, which is specified according to this tutorial.

The code I use:

from tesserocr import PyTessBaseAPI

image_save_path = 'some/path/to/jpg/file.jpg'
# uzn path is 'some/path/to/jpg/file.uzn' 

_tesseract_api = PyTessBaseAPI(
    lang='ces',
    psm=4,
    oem=1,
    path=os.getenv('TESSDATA_PREFIX')
)
_tesseract_api.ReadConfigFile("tsv")
_tesseract_api.ReadConfigFile("logfile")
_tesseract_api.SetImageFile(image_save_path)
_tesseract_api.Recognize()

_tesseract_api.GetUTF8Text()

The code returns the whole contents of the page, not the one specified in the OZN file.

Is it a bug or am I doing something wrong? Thanks!

DevKretov avatar Jul 08 '22 11:07 DevKretov

First of all: why you want to use uzn file if you can use API/SetRectangle? uzn file is for tesseract executable users... Next: https://github.com/tesseract-ocr/tesseract/issues/3837

zdenop avatar Jul 09 '22 08:07 zdenop

I want to use UZN file in order to get away from Tesseract's inner segmentation, which I cannot control and which fails on my documents - it does not find all regions of text in sparsely distributed text on a page.

Finally, I was able to set up UZN file with the help of API/ProcessPage, where I specified the filename parameter with the path to the image, where the UZN file is also present. Finally, it worked.

DevKretov avatar Jul 09 '22 10:07 DevKretov