tesserocr
tesserocr copied to clipboard
Is it possible to use tesserocr without Pillow and use openCV instead?
I am looking for a good python wrapper for the tesseract api and feel like I have come to the right place :)
The only grain of salt I see as of now is that from the usage examples and from a first glance at the code, Pillow seems to be embedded quite deeply. Is it possible to use tesserocr also with opencv images (simply numpy arrays) as input to the tesseract api?
You can use SetImageBytes
instead of SetImage
. But there are some functions that have no raw alternative (e.g. Get*Image
, ProcessPage
). Tesseract internally uses Leptonica data structures for image data, so some form of conversion is inevitable.
Interfacing between OpenCV and PIL is very straightforward, though (and can be done lossless).
Pillow is the nearest thing Python has to a standard imaging library. (And it has fewer dependencies than OpenCV.)
Hi,
Im am trying to use tesserocr with numpy images (opencv) to get rid off PIL. I have succeded when using color images but failed with grayscale ones.
Here is my code:
w = np_image.shape[1]
h = np_image.shape[0]
if len(np_image.shape) > 2:
np_image = cv2.cvtColor(np_image, cv2.COLOR_BGR2RGB)
bpp = 3
else:
bpp = 1
img_bytes = np_image.tobytes()
bpl = bpp * w
.....
tess_api.SetImageBytes(
imagedata=img_bytes,
width=w,
height=h,
bytes_per_pixel=bpp,
bytes_per_line=bpl)
Any ideas?
The tesseract SetImage
method expects an image buffer. The issue with cv2 is that it doesn't (afaik) provide the ability to convert an array to an image format (e.g. PNG) which is why you have to resort to other modules (such as PIL) to do that.
I think providing a method for passing an image buffer (let's call it SetImageBuffer
) which expects that image as a byte string would allow more flexibility for users that would like to rely on a different module than PIL for setting images.
@mlorenzo-alice your code looks correct for grayscale AFAICT. The len(np_image.shape) > 2
branch is not as convincing (because you can still have e.g. RGBA
, RGB
, LA
, L
channels), but I would expect the other one to work. What exactly does not work? (Have you tried extracting the raw or thresholded image with GetImage
or GetComponentImages
afterwards?)
@sirfz
I think providing a method for passing an image buffer (let's call it
SetImageBuffer
) which expects that image as a byte string would allow more flexibility for users that would like to rely on a different module than PIL for setting images.
You mean another method besides the existing SetImageBytes
illustrated above which wraps Tesseract API's raw buffer SetImage
function?
@sirfz
I think providing a method for passing an image buffer (let's call it
SetImageBuffer
) which expects that image as a byte string would allow more flexibility for users that would like to rely on a different module than PIL for setting images.You mean another method besides the existing
SetImageBytes
illustrated above which wraps Tesseract API's raw bufferSetImage
function?
Yes a method which accepts an image as a byte string and calls tesseracts SetImage
with the pix image, or if the above logic for calling SetImageBytes
from an array is generic for all images then we can just wrap it in a method (let's call it SetImageArray
). Something like:
def SetImageArray(self, array, int bpp):
cdef:
cuchar_t *cimagedata = array.tobytes()
int h = array.shape[0]
int w = array.shape[1]
int bpl = bpp * w
self._baseapi.SetImage(cimagedata, w, h, bpp, bpl)
Yes a method which accepts an image as a byte string ... if the above logic for calling
SetImageBytes
from an array is generic for all images then we can just wrap it in a method
But a byte string can be anything, you need additional information. SetImageBytes
is already pretty generic IMO.
As for taking a Numpy array, yes, I think this could be valuable (and you don't need extra information like bpp
in that case):
def SetImageArray(self, array):
if not isinstance(array, numpy.ndarray):
raise TypeError("SetImageArray requires a Numpy array")
if array.ndim not in [2, 3]:
raise TypeError("SetImageArray requires a 2d image or 2d with up to 4 color channels")
cdef:
if array.ndim == 2:
bpp = 1
else:
bpp = array.shape[2] # too coarse, see below
int h = array.shape[0]
int w = array.shape[1]
int bpl = bpp * w
cuchar_t *cimagedata = array.tobytes()
self._baseapi.SetImage(cimagedata, w, h, bpp, bpl)
Not sure though how well this can cope with alpha channels (RGBA
or LA
) or palette/label data.
From Tesseract's thresholder it seems that bpp can only be 1 (for 1x8-bit L
), 3 (for 3x8-bit RGB
) or 4 (for 1x32-bit I
or F
), so users would have to ensure they convert LA
, RGBA
, CMYK
, YCbCr
, LAB
, HSV
arrays themselves before passing them over.
Other places in Tesseract try to get rid of alpha channels actively.
I think it's worth adding the SetImageArray
method with proper documentation that includes the above (any extra information is appreciated here).
As for the byte string method, it's basically a lower-level version of SetImage
which basically converts a PIL.Image
object into a byte string like this:
cdef bytes _image_buffer(image):
"""Return raw bytes of a PIL Image"""
with BytesIO() as f:
image.save(f, image.format or 'PNG')
return f.getvalue()
and then calls pixReadMem(buff, size)
to create the Pix object which is passed to SetImage
. In this case pixReadMem
is figuring out the image format I guess.
The SetImageBuffer
method would look something like this:
def SetImageBuffer(self, cuchar_t *buffer):
cdef size_t size = len(buffer)
with nogil:
self._destroy_pix()
self._pix = pixReadMem(buffer, size)
if self._pix == NULL:
with gil:
raise RuntimeError('Error reading image')
self._baseapi.SetImage(self._pix)
That's just a copy of SetImage
but without the call to _image_buffer
.
The
SetImageBuffer
method would look something like this: ... That's just a copy ofSetImage
but without the call to_image_buffer
.
Oh, now I got it. Yes, that would be another valuable alternative for users that can somehow manage to produce a formatted byte stream including meta-data, but don't want to waste resources for disk I/O.
Maybe stream would be more appropriate for this than buffer, though. (The term buffer is usually used for raw, unformatted data of fixed length.)
@mlorenzo-alice your code looks correct for grayscale AFAICT. The
len(np_image.shape) > 2
branch is not as convincing (because you can still have e.g.RGBA
,RGB
,LA
,L
channels), but I would expect the other one to work. What exactly does not work? (Have you tried extracting the raw or thresholded image withGetImage
orGetComponentImages
afterwards?)@sirfz
You are rigth! Actually my only problem is when calling GetComponentImages
to extract text bounding boxes.
This is working as expected:
tess_api.SetImageBytes(
imagedata=img_bytes,
width=w,
height=h,
bytes_per_pixel=bpp,
bytes_per_line=bpl)
tess_api.Recognize()
tess_api.GetUTF8Text()
This is NOT working, throws NameError: name 'Image' is not defined
tess_api.SetImageBytes(
imagedata=img_bytes,
width=w,
height=h,
bytes_per_pixel=bpp,
bytes_per_line=bpl)
boxes = api.GetComponentImages(RIL.TEXTLINE, True)
This is NOT working, throws
NameError: name 'Image' is not defined
tess_api.SetImageBytes( imagedata=img_bytes, width=w, height=h, bytes_per_pixel=bpp, bytes_per_line=bpl) boxes = api.GetComponentImages(RIL.TEXTLINE, True)
That's because GetComponentImages
returns the list of images as PIL.Image
objects and it seems you don't have it installed.
PIL.Image
objects and it seems you don't have it installed.
IMHO tesserocr
should have an install_requires
with PIL.Image
.
I know you want to be more tolerant/flexible: https://github.com/sirfz/tesserocr/blob/1ba079f89a340187612e32258e58c0b88fa987ab/tesserocr.pyx#L26-L30
And you have try
and finally
in all usage contexts.
But then _pix_to_image
could use a catch NameError
with a proper error message...
Yes I didn't originally want to make Pillow mandatory since a lot of use cases can go ahead without the need for it at all. I agree though that in this case (and probably other cases too) the error message needs to be more descriptive.
I think for such methods that utilize Pillow, we can add extra parameters that allow users to switch off Pillow and return raw data instead. Or maybe a global switch to disable Pillow and purely return raw data only.
That would be great!
BTW, having a global pil_installed = True
when import succeeds, as in tests/test_api.py
would also be useful.
Or maybe a global switch to disable Pillow and purely return raw data only.
You could also make Pillow an install-time extra:
extras_require={ 'pillow': ['Pillow >= 7.1.2'] }
Finally, how about also interfacing with pixa
objects natively from Python via https://github.com/jsbueno/pyleptonica/pull/11?
Finally, how about also interfacing with
pixa
objects natively from Python via jsbueno/pyleptonica#11?
I did contemplate also mapping some leptonica functionality at the beginning (Pix being the main reason) but didn't really need it for my use case at the time. I'm tempted by your suggestion of using pyleptonica but not sure if it won't be an issue for later to maintain compatibility and the lack of proper windows support as well.
Finally, how about also interfacing with
pixa
objects natively from Python via jsbueno/pyleptonica#11?I did contemplate also mapping some leptonica functionality at the beginning (Pix being the main reason) but didn't really need it for my use case at the time. I'm tempted by your suggestion of using pyleptonica but not sure if it won't be an issue for later to maintain compatibility and the lack of proper windows support as well.
Yes, especially since pyleptonica itself does not seem to be maintained anymore.
But as we were already discussing the possibility to yield/accept raw data, I thought passing PIX
memory objects through in either direction would be a good compromise. That way, pyleptonica users could have stronger Tesseract integration, but we would not force any new dependencies. (The situation is a bit different to Pillow "support", because tesserocr
would not need to really do anything itself for leptonica objects.)
Following up on @mlorenzo-alice 's existing code, here's a function that allows for OpenCV colour conversion, while also handling grayscale and binary images:
def SetCVImage(self, image, color='BGR'):
""" Sets an OpenCV-style image for recognition.
'image' is a numpy ndarray in color, grayscale, or binary (boolean)
format.
'color' is a string representing the current color of the image,
for conversion using OpenCV into an RGB array image. By default
color images in OpenCV use BGR, but any valid channel
specification can be used (e.g. 'BGRA', 'XYZ', 'YCrCb', 'HSV', 'HLS',
'Lab', 'Luv', 'BayerBG', 'BayerGB', 'BayerRG', 'BayerGR').
Conversion only occurs if the third dimension of the array is
not 1, else 'color' is ignored.
"""
bytes_per_pixel = image.shape[2] if len(image.shape) == 3 else 1
height, width = image.shape[:2]
bytes_per_line = bytes_per_pixel * width
if bytes_per_pixel != 1 and color != 'RGB':
# non-RGB color image -> convert to RGB
image = cv2.cvtColor(image, getattr(cv2, f'COLOR_{color}2RGB'))
elif bytes_per_pixel == 1 and image.dtype == bool:
# binary image -> convert to bitstream
image = np.packbits(image, axis=1)
bytes_per_line = image.shape[1]
width = bytes_per_line * 8
bytes_per_pixel = 0
# else image already RGB or grayscale
self.SetImageBytes(image.tobytes(), width, height,
bytes_per_pixel, bytes_per_line)
The docstring is in my personally preferred format, and I'm not sure how to integrate this with Cython
because I haven't used it before, but other than that I believe it's PR-ready.
NOTE: requires
import numpy as np
import cv2
EDIT: This also clears up #47
~EDIT2: modified to also pass through 32-bit (RGBA) images without requiring conversion to RGB~
EDIT3: restored to only allow RGB images to pass through without conversion, as @bertsky specified that the allowed 32-bit input is for I
or F
1x32 bit images, not for 4x8-bit RGBA (also confirmed that in their provided link to the tesseract thresholder code).
EDIT4,5: second if
-> elif
(non-overlapping cases); added else clause comment; formatting
EDIT 6: for using this in the interim while it’s not a part of the package, you can make a file like the following
from tesserocr import PyTessBaseAPI
def SetCVImage(...):
...
PyTessBaseAPI.SetCVImage = SetCVImage
and then import PyTessBaseAPI from that file instead of directly from tesserocr. It’s also possible to subclass PyTessBaseAPI to add the extra function, but the above approach uses less code and has the same result.
SetImageBytes is so close to working for numpy arrays however it requires an (unnecessary) copy to work (the arr.tobytes()
on the numpy array)
it looks like arr.data
returns a memoryview
which should work fine as a char*
passed along to tesseract's api -- but it looks like the cython is written a little too strict (with _b(...)
) preventing memoryview
(and I guess bytearray
too)
it looks like defining the parameter as const unsigned char [::1]
would work? since this seems to accept bytes and memoryview equally? https://github.com/cython/cython/issues/3488#issuecomment-609039796
@asottile that sounds reasonable, feel free to contribute a PR if you have the time/will to do so
so I poked at this a bit and it's a little more complicated unfortunately -- that change would allow it to accept memoryviews -- but the cv2 arrays are not necessarily C-contiguous which is what tesseract expects
here's the start of that patch but it doesn't actually get me all the way there unfortunately and making a C-contiguous array from a cv2 numpy array incurs a copy anyway so it wouldn't even benefit:
$ git diff
diff --git a/tesserocr.pyx b/tesserocr.pyx
index 333e3ea..e5c21b4 100644
--- a/tesserocr.pyx
+++ b/tesserocr.pyx
@@ -1605,7 +1605,7 @@ cdef class PyTessBaseAPI:
"""
self._baseapi.ClearAdaptiveClassifier()
- def SetImageBytes(self, imagedata, int width, int height,
+ def SetImageBytes(self, const unsigned char[::1] imagedata, int width, int height,
int bytes_per_pixel, int bytes_per_line):
"""Provide an image for Tesseract to recognize.
@@ -1618,7 +1618,7 @@ cdef class PyTessBaseAPI:
will automatically perform recognition.
Args:
- imagedata (str): Raw image bytes.
+ imagedata (bytes): Raw image bytes.
width (int): image width.
height (int): image height.
bytes_per_pixel (int): bytes per pixel.
@@ -1630,12 +1630,9 @@ cdef class PyTessBaseAPI:
1 represents WHITE. For binary images set bytes_per_pixel=0.
bytes_per_line (int): bytes per line.
"""
- cdef:
- bytes py_imagedata = _b(imagedata)
- cuchar_t *cimagedata = py_imagedata
with nogil:
self._destroy_pix()
- self._baseapi.SetImage(cimagedata, width, height, bytes_per_pixel, bytes_per_line)
+ self._baseapi.SetImage(&imagedata[0], width, height, bytes_per_pixel, bytes_per_line)
def SetImageBytesBmp(self, imagedata):
"""Provide an image for Tesseract to recognize.