tesserocr
tesserocr copied to clipboard
Different result using tesserocr and command line version
Hi,
I'm getting different results for the same document using tesserocr and the command line version of Tesseract. It seems like tesserocr is not getting some of the words. What could be wrong? Thanks!
I'm using this function to get all the words from an image:
def get_word_data(img):
image = Image.open(img, mode='r')
pdf = pdfpage.PDFPage('folder location', 1)
level = RIL.WORD
with PyTessBaseAPI() as api:
api.SetImage(image)
boxes = api.GetComponentImages(RIL.WORD, True)
for i, (im, box, _, _)in enumerate(boxes):
api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
ocrResult = api.GetUTF8Text()
ocrResult = ocrResult.strip()
if len(ocrResult) != 0:
conf = api.MeanTextConf()
if (ocrResult[-1] in punct):
ocrResult = ocrResult[:-1]
doc_word = word.Word(
i,
box['x'],
box['y'],
box['w'],
box['h'],
conf,
ocrResult)
pdf.add_word(doc_word)
image.close()
return pdf
Try using api.SetImageFile
instead of api.SetImage
to make sure it's not PIL.Image
altering the image somehow.
What's the command you're running via the cli? It could be calling a different API so you can't compare with that.
Try running an example similar to Iterator over the classifier choices for a single symbol.
I'm using tesseract filname outputname on the command line, basically all the default settings. My issue is that in my code, I'm not getting some words like: PAGE, RUN TIME, RUN USER, etc.. My goal is not to get the recognition result on a single symbol or a box, but to get all the words on a document. Thanks!
The default psm when using the comman line is 6 (Auto layout analysis).
The default psm when using the C++ API is 3 (Single Block).