tesserocr Is this expected? PIL+image_to_text and file_to

I ran the sample code thats in the readme text with one of my images and interestingly, it gives two different results when using PIL and then image_to_text rather than going directly with file_to_text. The PIL version seems to perform better, and the images are just regular JPEGs. Sample code being referenced and output is below

CODE

import tesserocr
from PIL import Image

print tesserocr.tesseract_version()  # print tesseract-ocr version
print tesserocr.get_languages()  # prints tessdata path and list of available languages

image = Image.open(l.blobname)
print tesserocr.image_to_text(image)  # print ocr text from image
print "==================================================================="
# or
print tesserocr.file_to_text(l.blobname)

OUTPUT


tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

(u'/usr/share/tesseract-ocr/tessdata/', [u'equ', u'eng', u'osd'])
Everyday
Low
Price
Hem n 796577
um pm 5»

murAﬂWuNNEHSM

 


===================================================================
 

1m mm umvtnm

s 1 7885333252;

 
     
   
  

Everyday
Low

 

PVICE

Feb 26 '17 12:02 thavidu

That's weird indeed, I can't really help you unless you provide the image for me to reproduce this issue. Here are some tips I can think of for debugging this problem:

Try running the tesseract console command against your image: tesseract <image-file> stdout
Try viewing the thresholded images after loading the image into the tesseract API:

api = tesserocr.PyTessBaseAPI()
image = Image.open(l.blobname)

# load PIL image
api.SetImage(image)
# view thresholded image
api.GetThresholdedImage().show()

# load image from file
api.SetImageFile(l.blobname)
# view thresholded image
api.GetThresholdedImage().show()

This should show you if there are any visible difference between the images processed by the API.

If you'd like to share your image for reproducing the issue, please share your PIL and tesserocr versions as well.

Feb 26 '17 20:02 sirfz

Might be related:

https://github.com/sirfz/tesserocr/issues/55#issuecomment-309237269

sirfz commented

Another thing to note is that Pillow seems to apply some kind of pre-processing to the image after loading it since the resulting image is not identical to the original. It might be worth bringing this up with the Pillow team.

Jun 18 '17 21:06 amitdo

It would be great if @sirfz tell us how to use OpenCV instead of PIL.

Dec 15 '17 11:12 Link009

@Link009 OpenCV can be used as follow: Instead of calling

tessapi.SetImage(img) you can call

try:
    channels = img.shape[2]
except IndexError:
    channels = 1
tessapi.SetImageBytes(img.tobytes(), img.shape[1], img.shape[0], channels, channels*img.shape[1])

Jan 15 '18 15:01 mpmX

tesserocr
tesserocr copied to clipboard

Is this expected? PIL+image_to_text and file_to_text give different results

tesserocr tesserocr copied to clipboard

Is this expected? PIL+image_to_text and file_to_text give different results

tesserocr
tesserocr copied to clipboard