tesserocr Different result using tesserocr and command line version

Different result using tesserocr and command line version

Open TheSithPadawan opened this issue 6 years ago • 4 comments

Hi,

I'm getting different results for the same document using tesserocr and the command line version of Tesseract. It seems like tesserocr is not getting some of the words. What could be wrong? Thanks!

I'm using this function to get all the words from an image:

def get_word_data(img):
    image = Image.open(img, mode='r')
    pdf = pdfpage.PDFPage('folder location', 1)
    level = RIL.WORD
    with PyTessBaseAPI() as api:
        api.SetImage(image)
        boxes = api.GetComponentImages(RIL.WORD, True)
        for i, (im, box, _, _)in enumerate(boxes):
            api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
            ocrResult = api.GetUTF8Text()
            ocrResult = ocrResult.strip()
            if len(ocrResult) != 0:
                conf = api.MeanTextConf()
                if (ocrResult[-1] in punct):
                    ocrResult = ocrResult[:-1]
                doc_word = word.Word(
                    i,
                    box['x'],
                    box['y'],
                    box['w'],
                    box['h'],
                    conf,
                    ocrResult)
                pdf.add_word(doc_word)
    image.close()
    return pdf

Mar 20 '18 01:03 TheSithPadawan

Try using api.SetImageFile instead of api.SetImage to make sure it's not PIL.Image altering the image somehow.

Mar 20 '18 15:03 sirfz

What's the command you're running via the cli? It could be calling a different API so you can't compare with that.

Try running an example similar to Iterator over the classifier choices for a single symbol.

Mar 20 '18 17:03 sirfz

I'm using tesseract filname outputname on the command line, basically all the default settings. My issue is that in my code, I'm not getting some words like: PAGE, RUN TIME, RUN USER, etc.. My goal is not to get the recognition result on a single symbol or a box, but to get all the words on a document. Thanks!

Mar 20 '18 17:03 TheSithPadawan

The default psm when using the comman line is 6 (Auto layout analysis).

The default psm when using the C++ API is 3 (Single Block).

Jul 16 '18 18:07 amitdo

tesserocr tesserocr copied to clipboard

Different result using tesserocr and command line version

tesserocr
tesserocr copied to clipboard