tesserocr
tesserocr copied to clipboard
Tesserocr does not read UZN files
Hello,
when I want to specify the regions of interest via .UZN file (zones file), tesserocr does not pay attention to this file, which is specified according to this tutorial.
The code I use:
from tesserocr import PyTessBaseAPI
image_save_path = 'some/path/to/jpg/file.jpg'
# uzn path is 'some/path/to/jpg/file.uzn'
_tesseract_api = PyTessBaseAPI(
lang='ces',
psm=4,
oem=1,
path=os.getenv('TESSDATA_PREFIX')
)
_tesseract_api.ReadConfigFile("tsv")
_tesseract_api.ReadConfigFile("logfile")
_tesseract_api.SetImageFile(image_save_path)
_tesseract_api.Recognize()
_tesseract_api.GetUTF8Text()
The code returns the whole contents of the page, not the one specified in the OZN file.
Is it a bug or am I doing something wrong? Thanks!
First of all: why you want to use uzn file if you can use API/SetRectangle? uzn file is for tesseract executable users... Next: https://github.com/tesseract-ocr/tesseract/issues/3837
I want to use UZN file in order to get away from Tesseract's inner segmentation, which I cannot control and which fails on my documents - it does not find all regions of text in sparsely distributed text on a page.
Finally, I was able to set up UZN file with the help of API/ProcessPage, where I specified the filename parameter with the path to the image, where the UZN file is also present. Finally, it worked.