tesserocr Is it possible to use tesserocr without Pillow and use openCV instead?

I am looking for a good python wrapper for the tesseract api and feel like I have come to the right place :)

The only grain of salt I see as of now is that from the usage examples and from a first glance at the code, Pillow seems to be embedded quite deeply. Is it possible to use tesserocr also with opencv images (simply numpy arrays) as input to the tesseract api?

Oct 10 '19 08:10 neuneck

You can use SetImageBytes instead of SetImage. But there are some functions that have no raw alternative (e.g. Get*Image, ProcessPage). Tesseract internally uses Leptonica data structures for image data, so some form of conversion is inevitable.

Interfacing between OpenCV and PIL is very straightforward, though (and can be done lossless).

Pillow is the nearest thing Python has to a standard imaging library. (And it has fewer dependencies than OpenCV.)

Dec 18 '19 02:12 bertsky

Hi,

Im am trying to use tesserocr with numpy images (opencv) to get rid off PIL. I have succeded when using color images but failed with grayscale ones.

Here is my code:

       w = np_image.shape[1]
       h = np_image.shape[0]

        if len(np_image.shape) > 2:
            np_image = cv2.cvtColor(np_image, cv2.COLOR_BGR2RGB)
            bpp = 3
        else:
            bpp = 1
         
        img_bytes = np_image.tobytes()
        bpl = bpp * w

       .....

        tess_api.SetImageBytes(
                       imagedata=img_bytes,
                       width=w,
                       height=h,
                       bytes_per_pixel=bpp,
                       bytes_per_line=bpl)

Any ideas?

Jul 01 '20 14:07 miguel-lorenzo

The tesseract SetImage method expects an image buffer. The issue with cv2 is that it doesn't (afaik) provide the ability to convert an array to an image format (e.g. PNG) which is why you have to resort to other modules (such as PIL) to do that.

I think providing a method for passing an image buffer (let's call it SetImageBuffer) which expects that image as a byte string would allow more flexibility for users that would like to rely on a different module than PIL for setting images.

Jul 01 '20 15:07 sirfz

@mlorenzo-alice your code looks correct for grayscale AFAICT. The len(np_image.shape) > 2 branch is not as convincing (because you can still have e.g. RGBA, RGB, LA, L channels), but I would expect the other one to work. What exactly does not work? (Have you tried extracting the raw or thresholded image with GetImage or GetComponentImages afterwards?)

@sirfz

I think providing a method for passing an image buffer (let's call it SetImageBuffer) which expects that image as a byte string would allow more flexibility for users that would like to rely on a different module than PIL for setting images.

You mean another method besides the existing SetImageBytes illustrated above which wraps Tesseract API's raw buffer SetImage function?

Jul 01 '20 16:07 bertsky

@sirfz

I think providing a method for passing an image buffer (let's call it SetImageBuffer) which expects that image as a byte string would allow more flexibility for users that would like to rely on a different module than PIL for setting images.

You mean another method besides the existing SetImageBytes illustrated above which wraps Tesseract API's raw buffer SetImage function?

Yes a method which accepts an image as a byte string and calls tesseracts SetImage with the pix image, or if the above logic for calling SetImageBytes from an array is generic for all images then we can just wrap it in a method (let's call it SetImageArray). Something like:

def SetImageArray(self, array, int bpp):
    cdef:
        cuchar_t *cimagedata = array.tobytes()
        int h = array.shape[0]
        int w = array.shape[1]
        int bpl = bpp * w

    self._baseapi.SetImage(cimagedata, w, h, bpp, bpl)

Jul 01 '20 17:07 sirfz

Yes a method which accepts an image as a byte string ... if the above logic for calling SetImageBytes from an array is generic for all images then we can just wrap it in a method

But a byte string can be anything, you need additional information. SetImageBytes is already pretty generic IMO.

As for taking a Numpy array, yes, I think this could be valuable (and you don't need extra information like bpp in that case):

def SetImageArray(self, array):
    if not isinstance(array, numpy.ndarray):
        raise TypeError("SetImageArray requires a Numpy array")
    if array.ndim not in [2, 3]:
        raise TypeError("SetImageArray requires a 2d image or 2d with up to 4 color channels")
    cdef:
        if array.ndim == 2:
            bpp = 1
        else:
            bpp = array.shape[2] # too coarse, see below
        int h = array.shape[0]
        int w = array.shape[1]
        int bpl = bpp * w
        cuchar_t *cimagedata = array.tobytes()

    self._baseapi.SetImage(cimagedata, w, h, bpp, bpl)

Not sure though how well this can cope with alpha channels (RGBA or LA) or palette/label data.

From Tesseract's thresholder it seems that bpp can only be 1 (for 1x8-bit L), 3 (for 3x8-bit RGB) or 4 (for 1x32-bit I or F), so users would have to ensure they convert LA, RGBA, CMYK, YCbCr, LAB, HSV arrays themselves before passing them over.

Other places in Tesseract try to get rid of alpha channels actively.

Jul 01 '20 18:07 bertsky

I think it's worth adding the SetImageArray method with proper documentation that includes the above (any extra information is appreciated here).

As for the byte string method, it's basically a lower-level version of SetImage which basically converts a PIL.Image object into a byte string like this:

cdef bytes _image_buffer(image):
    """Return raw bytes of a PIL Image"""
    with BytesIO() as f:
        image.save(f, image.format or 'PNG')
        return f.getvalue()

and then calls pixReadMem(buff, size) to create the Pix object which is passed to SetImage. In this case pixReadMem is figuring out the image format I guess.

The SetImageBuffer method would look something like this:

def SetImageBuffer(self, cuchar_t *buffer):
    cdef size_t size = len(buffer)
    with nogil:
        self._destroy_pix()
        self._pix = pixReadMem(buffer, size)
        if self._pix == NULL:
            with gil:
                raise RuntimeError('Error reading image')
        self._baseapi.SetImage(self._pix)

That's just a copy of SetImage but without the call to _image_buffer.

Jul 01 '20 19:07 sirfz

The SetImageBuffer method would look something like this: ... That's just a copy of SetImage but without the call to _image_buffer.

Oh, now I got it. Yes, that would be another valuable alternative for users that can somehow manage to produce a formatted byte stream including meta-data, but don't want to waste resources for disk I/O.

Maybe stream would be more appropriate for this than buffer, though. (The term buffer is usually used for raw, unformatted data of fixed length.)

Jul 01 '20 21:07 bertsky

@mlorenzo-alice your code looks correct for grayscale AFAICT. The len(np_image.shape) > 2 branch is not as convincing (because you can still have e.g. RGBA, RGB, LA, L channels), but I would expect the other one to work. What exactly does not work? (Have you tried extracting the raw or thresholded image with GetImage or GetComponentImages afterwards?)

@sirfz

You are rigth! Actually my only problem is when calling GetComponentImages to extract text bounding boxes.

This is working as expected:

tess_api.SetImageBytes(
                       imagedata=img_bytes,
                       width=w,
                       height=h,
                       bytes_per_pixel=bpp,
                       bytes_per_line=bpl)
tess_api.Recognize()
tess_api.GetUTF8Text()

This is NOT working, throws NameError: name 'Image' is not defined

tess_api.SetImageBytes(
                       imagedata=img_bytes,
                       width=w,
                       height=h,
                       bytes_per_pixel=bpp,
                       bytes_per_line=bpl)

boxes = api.GetComponentImages(RIL.TEXTLINE, True)

Jul 02 '20 07:07 miguel-lorenzo

This is NOT working, throws NameError: name 'Image' is not defined

tess_api.SetImageBytes(
                       imagedata=img_bytes,
                       width=w,
                       height=h,
                       bytes_per_pixel=bpp,
                       bytes_per_line=bpl)

boxes = api.GetComponentImages(RIL.TEXTLINE, True)

That's because GetComponentImages returns the list of images as PIL.Image objects and it seems you don't have it installed.

Jul 02 '20 13:07 sirfz

PIL.Image objects and it seems you don't have it installed.

IMHO tesserocr should have an install_requires with PIL.Image.

I know you want to be more tolerant/flexible: https://github.com/sirfz/tesserocr/blob/1ba079f89a340187612e32258e58c0b88fa987ab/tesserocr.pyx#L26-L30

And you have try and finally in all usage contexts.

But then _pix_to_image could use a catch NameError with a proper error message...

Jul 02 '20 14:07 bertsky

Yes I didn't originally want to make Pillow mandatory since a lot of use cases can go ahead without the need for it at all. I agree though that in this case (and probably other cases too) the error message needs to be more descriptive.

I think for such methods that utilize Pillow, we can add extra parameters that allow users to switch off Pillow and return raw data instead. Or maybe a global switch to disable Pillow and purely return raw data only.

Jul 02 '20 14:07 sirfz

That would be great!

Jul 02 '20 15:07 miguel-lorenzo

BTW, having a global pil_installed = True when import succeeds, as in tests/test_api.py would also be useful.

Or maybe a global switch to disable Pillow and purely return raw data only.

You could also make Pillow an install-time extra:

extras_require={ 'pillow': ['Pillow >= 7.1.2'] }

Finally, how about also interfacing with pixa objects natively from Python via https://github.com/jsbueno/pyleptonica/pull/11?

Jul 02 '20 16:07 bertsky

Finally, how about also interfacing with pixa objects natively from Python via jsbueno/pyleptonica#11?

I did contemplate also mapping some leptonica functionality at the beginning (Pix being the main reason) but didn't really need it for my use case at the time. I'm tempted by your suggestion of using pyleptonica but not sure if it won't be an issue for later to maintain compatibility and the lack of proper windows support as well.

Jul 03 '20 14:07 sirfz

Finally, how about also interfacing with pixa objects natively from Python via jsbueno/pyleptonica#11?

I did contemplate also mapping some leptonica functionality at the beginning (Pix being the main reason) but didn't really need it for my use case at the time. I'm tempted by your suggestion of using pyleptonica but not sure if it won't be an issue for later to maintain compatibility and the lack of proper windows support as well.

Yes, especially since pyleptonica itself does not seem to be maintained anymore.

But as we were already discussing the possibility to yield/accept raw data, I thought passing PIX memory objects through in either direction would be a good compromise. That way, pyleptonica users could have stronger Tesseract integration, but we would not force any new dependencies. (The situation is a bit different to Pillow "support", because tesserocr would not need to really do anything itself for leptonica objects.)

Jul 03 '20 14:07 bertsky

Following up on @mlorenzo-alice 's existing code, here's a function that allows for OpenCV colour conversion, while also handling grayscale and binary images:

    def SetCVImage(self, image, color='BGR'):
        """ Sets an OpenCV-style image for recognition.

        'image' is a numpy ndarray in color, grayscale, or binary (boolean)
            format.
        'color' is a string representing the current color of the image,
            for conversion using OpenCV into an RGB array image. By default
            color images in OpenCV use BGR, but any valid channel
            specification can be used (e.g. 'BGRA', 'XYZ', 'YCrCb', 'HSV', 'HLS',
            'Lab', 'Luv', 'BayerBG', 'BayerGB', 'BayerRG', 'BayerGR').
            Conversion only occurs if the third dimension of the array is
            not 1, else 'color' is ignored.

        """
        bytes_per_pixel = image.shape[2] if len(image.shape) == 3 else 1
        height, width   = image.shape[:2]
        bytes_per_line  = bytes_per_pixel * width

        if bytes_per_pixel != 1 and color != 'RGB':
            # non-RGB color image -> convert to RGB
            image = cv2.cvtColor(image, getattr(cv2, f'COLOR_{color}2RGB'))
        elif bytes_per_pixel == 1 and image.dtype == bool:
            # binary image -> convert to bitstream
            image = np.packbits(image, axis=1)
            bytes_per_line  = image.shape[1]
            width = bytes_per_line * 8
            bytes_per_pixel = 0
        # else image already RGB or grayscale

        self.SetImageBytes(image.tobytes(), width, height,
                           bytes_per_pixel, bytes_per_line)

The docstring is in my personally preferred format, and I'm not sure how to integrate this with Cython because I haven't used it before, but other than that I believe it's PR-ready.

NOTE: requires

import numpy as np
import cv2

EDIT: This also clears up #47 ~EDIT2: modified to also pass through 32-bit (RGBA) images without requiring conversion to RGB~ EDIT3: restored to only allow RGB images to pass through without conversion, as @bertsky specified that the allowed 32-bit input is for I or F 1x32 bit images, not for 4x8-bit RGBA (also confirmed that in their provided link to the tesseract thresholder code). EDIT4,5: second if -> elif (non-overlapping cases); added else clause comment; formatting EDIT 6: for using this in the interim while it’s not a part of the package, you can make a file like the following

from tesserocr import PyTessBaseAPI

def SetCVImage(...):
    ...

PyTessBaseAPI.SetCVImage = SetCVImage

and then import PyTessBaseAPI from that file instead of directly from tesserocr. It’s also possible to subclass PyTessBaseAPI to add the extra function, but the above approach uses less code and has the same result.

Dec 21 '20 12:12 ES-Alexander

SetImageBytes is so close to working for numpy arrays however it requires an (unnecessary) copy to work (the arr.tobytes() on the numpy array)

it looks like arr.data returns a memoryview which should work fine as a char* passed along to tesseract's api -- but it looks like the cython is written a little too strict (with _b(...)) preventing memoryview (and I guess bytearray too)

it looks like defining the parameter as const unsigned char [::1] would work? since this seems to accept bytes and memoryview equally? https://github.com/cython/cython/issues/3488#issuecomment-609039796

Oct 21 '23 20:10 asottile

@asottile that sounds reasonable, feel free to contribute a PR if you have the time/will to do so

Oct 23 '23 15:10 sirfz

so I poked at this a bit and it's a little more complicated unfortunately -- that change would allow it to accept memoryviews -- but the cv2 arrays are not necessarily C-contiguous which is what tesseract expects

here's the start of that patch but it doesn't actually get me all the way there unfortunately and making a C-contiguous array from a cv2 numpy array incurs a copy anyway so it wouldn't even benefit:

$ git diff
diff --git a/tesserocr.pyx b/tesserocr.pyx
index 333e3ea..e5c21b4 100644
--- a/tesserocr.pyx
+++ b/tesserocr.pyx
@@ -1605,7 +1605,7 @@ cdef class PyTessBaseAPI:
         """
         self._baseapi.ClearAdaptiveClassifier()
 
-    def SetImageBytes(self, imagedata, int width, int height,
+    def SetImageBytes(self, const unsigned char[::1] imagedata, int width, int height,
                       int bytes_per_pixel, int bytes_per_line):
         """Provide an image for Tesseract to recognize.
 
@@ -1618,7 +1618,7 @@ cdef class PyTessBaseAPI:
         will automatically perform recognition.
 
         Args:
-            imagedata (str): Raw image bytes.
+            imagedata (bytes): Raw image bytes.
             width (int): image width.
             height (int): image height.
             bytes_per_pixel (int): bytes per pixel.
@@ -1630,12 +1630,9 @@ cdef class PyTessBaseAPI:
                 1 represents WHITE. For binary images set bytes_per_pixel=0.
             bytes_per_line (int): bytes per line.
         """
-        cdef:
-            bytes py_imagedata = _b(imagedata)
-            cuchar_t *cimagedata = py_imagedata
         with nogil:
             self._destroy_pix()
-            self._baseapi.SetImage(cimagedata, width, height, bytes_per_pixel, bytes_per_line)
+            self._baseapi.SetImage(&imagedata[0], width, height, bytes_per_pixel, bytes_per_line)
 
     def SetImageBytesBmp(self, imagedata):
         """Provide an image for Tesseract to recognize.

Oct 23 '23 23:10 asottile

tesserocr tesserocr copied to clipboard

Is it possible to use tesserocr without Pillow and use openCV instead?

tesserocr
tesserocr copied to clipboard