tesserocr
tesserocr copied to clipboard
Is this expected? PIL+image_to_text and file_to_text give different results
I ran the sample code thats in the readme text with one of my images and interestingly, it gives two different results when using PIL and then image_to_text rather than going directly with file_to_text. The PIL version seems to perform better, and the images are just regular JPEGs. Sample code being referenced and output is below
CODE
import tesserocr
from PIL import Image
print tesserocr.tesseract_version() # print tesseract-ocr version
print tesserocr.get_languages() # prints tessdata path and list of available languages
image = Image.open(l.blobname)
print tesserocr.image_to_text(image) # print ocr text from image
print "==================================================================="
# or
print tesserocr.file_to_text(l.blobname)
OUTPUT
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
(u'/usr/share/tesseract-ocr/tessdata/', [u'equ', u'eng', u'osd'])
Everyday
Low
Price
Hem n 796577
um pm 5»
murAflWuNNEHSM
===================================================================
1m mm umvtnm
s 1 7885333252;
Everyday
Low
PVICE
That's weird indeed, I can't really help you unless you provide the image for me to reproduce this issue. Here are some tips I can think of for debugging this problem:
-
Try running the tesseract console command against your image:
tesseract <image-file> stdout
-
Try viewing the thresholded images after loading the image into the tesseract API:
api = tesserocr.PyTessBaseAPI()
image = Image.open(l.blobname)
# load PIL image
api.SetImage(image)
# view thresholded image
api.GetThresholdedImage().show()
# load image from file
api.SetImageFile(l.blobname)
# view thresholded image
api.GetThresholdedImage().show()
This should show you if there are any visible difference between the images processed by the API.
If you'd like to share your image for reproducing the issue, please share your PIL and tesserocr versions as well.
Might be related:
https://github.com/sirfz/tesserocr/issues/55#issuecomment-309237269
sirfz commented
Another thing to note is that Pillow seems to apply some kind of pre-processing to the image after loading it since the resulting image is not identical to the original. It might be worth bringing this up with the Pillow team.
It would be great if @sirfz tell us how to use OpenCV instead of PIL.
@Link009 OpenCV can be used as follow: Instead of calling
tessapi.SetImage(img)
you can call
try:
channels = img.shape[2]
except IndexError:
channels = 1
tessapi.SetImageBytes(img.tobytes(), img.shape[1], img.shape[0], channels, channels*img.shape[1])