pytesseract Research the option of using stdin/stdout instead saving image on disk

Hi, I am wondering why you don't use stdin argument to send the image to tesseract instead of saving it on the disk? https://github.com/madmaze/pytesseract/blob/25a9d38649f6d9f907f9c6750cab03d699d0b340/src/pytesseract.py#L208

Jan 02 '19 06:01 cgallay

Hi @cgallay short answer: that was the initial implementation I agree that it's not the optimal solution and maybe it should be used only for debugging purposes.

This question is also relevant for the stdout.

Jan 02 '19 18:01 bozhodimitrov

I found some issues with the tesseract stdin/stout and some modes/versions are affected. For reference: tesseract-ocr/tesseract#785 , tesseract-ocr/tesseract#85 and etc.

Jan 04 '19 15:01 bozhodimitrov

It seems that the problems with the stdin are fixed in tesseract 5.0, but testing is needed.

Jul 26 '19 11:07 bozhodimitrov

Hi, wanted to ask if this feature is added in the latest release of pytesseract. Saving image on the disk the purpose of running the tesseract command is quite slow as of now ( I think that is what PyTesseract is doing right now ). Can I work on this feature if its not implemented ( will need help from you to implement this ).

Oct 30 '19 05:10 AyushP123

At the moment pytesseract supports passing the path of the images itself, which will skip creating temp files on disk. You can also pass the path to a text file with list of images for batch processing - this will also skip the pytesseract temp files. Both of those options are documented.

About the feature itself - we need to first test the tesseract stdin on most recent versions in order to be sure that it will work correctly with pytesseract.

Oct 30 '19 08:10 bozhodimitrov

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

Mar 11 '21 07:03 j-hap

Nice, you can always monkey patch the module level function and make it work for you. One problem here is that this functionality might be limited to the newer versions of tesseract and we should consider the older 3.x versions too.

Mar 11 '21 10:03 bozhodimitrov

I just compiled version 3.05.02 from https://github.com/tesseract-ocr/tesseract/tree/3.05.02 and ran the same modification as above with no problems with traineddata from https://github.com/tesseract-ocr/tessdata/tree/3.04.00. I know it's just a sample and not a full fledged test, but maybe that helps.

Mar 12 '21 07:03 j-hap

Yes, this is very helpful. This means that this feature can be added to pytesseract. But we will need to add additional tests in order to be sure that the other functionality doesn't break. By other functionality, I mean the option to pass image paths as raw strings to pytesseract functions. Also, I am not sure if tesseract will honor the configuration options in combination with the stdin input.

And finally, based on the above - we should decide if this will be the default implementation (and if we should keep the old one around or not).

Mar 12 '21 11:03 bozhodimitrov

I just wrote a CLI wrapper for self-using, which use stdin and stdout to communicate with tesseract executable. Also I convert the tsv output into a more intuitive dict structure. The wrapper object could be fed with image in different flavors: image_file_path, image_bytes, PIL image object, ndarray. No temp file created for input or output. https://github.com/mo-han/mo-han-toolbox/blob/master/mylib/wrapper/tesseract_ocr.py

from mylib.wrapper import tesseract_ocr
from PIL import Image

t=tesseract_ocr.TesseractOCRCLIWrapper(r'C:\Users\mo-han\AppData\Local\Programs\Tesseract-OCR\tesseract.exe')
t.set_language('chi_sim', 'eng').set_image_object(Image.open('r:1.jpg')).get_ocr_tsv_to_dict(psm=3, min_confidence=0.8)

[{'text': '名',
  'confidence': 0.91,
  'box': ((320, 10), (333, 10), (333, 23), (320, 23)),
  'page block paragraph line word level': (1, 1, 1, 1, 1, 5)},
 {'text': '称',
  'confidence': 0.87,
  'box': ((320, 28), (333, 28), (333, 40), (320, 40)),
  'page block paragraph line word level': (1, 1, 1, 1, 2, 5)},
 {'text': '修改',
  'confidence': 0.9,
  'box': ((320, 336), (333, 336), (333, 366), (320, 366)),
  'page block paragraph line word level': (1, 1, 1, 1, 3, 5)},
 {'text': '日',
  'confidence': 0.96,
  'box': ((320, 372), (333, 372), (333, 379), (320, 379)),
  'page block paragraph line word level': (1, 1, 1, 1, 4, 5)},
 {'text': '期',
  'confidence': 0.96,
  'box': ((320, 387), (333, 387), (333, 395), (320, 395)),
  'page block paragraph line word level': (1, 1, 1, 1, 5, 5)},
 {'text': ' ',
  'confidence': 0.95,
  'box': ((0, 195), (0, 195), (0, 224), (0, 224)),
  'page block paragraph line word level': (1, 2, 1, 1, 1, 5)},
 {'text': 'label_cn.txt',
  'confidence': 0.8,
  'box': ((283, 38), (295, 38), (295, 120), (283, 120)),
  'page block paragraph line word level': (1, 3, 1, 1, 1, 5)},
 {'text': '2019/8/5',
  'confidence': 0.87,
  'box': ((281, 337), (294, 337), (294, 401), (281, 401)),
  'page block paragraph line word level': (1, 3, 1, 1, 2, 5)},
 {'text': '15:24',
  'confidence': 0.95,
  'box': ((283, 408), (294, 408), (294, 446), (283, 446)),
  'page block paragraph line word level': (1, 3, 1, 1, 3, 5)},
 {'text': '2019/8/5',
  'confidence': 0.91,
  'box': ((254, 337), (267, 337), (267, 401), (254, 401)),
  'page block paragraph line word level': (1, 3, 1, 2, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((256, 408), (267, 408), (267, 446), (256, 446)),
  'page block paragraph line word level': (1, 3, 1, 2, 4, 5)},
 {'text': '2019/8/5',
  'confidence': 0.89,
  'box': ((227, 337), (240, 337), (240, 401), (227, 401)),
  'page block paragraph line word level': (1, 3, 1, 3, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((229, 408), (240, 408), (240, 446), (229, 446)),
  'page block paragraph line word level': (1, 3, 1, 3, 4, 5)}]

Apr 01 '21 04:04 mo-han

Hi there, Having this implemented would be very useful as me and another dev are trying to read frames in a video and having a 400ms process time for each frame times 30fps for the video leads to very long process times. I have a OpenCV image in Python and would like to just pass that directly into Tesseract instead of having it saved on the disk.

Jun 04 '21 22:06 GreenCobalt

@GreenCobalt you could try my example code above

Jun 04 '21 23:06 mo-han

@GreenCobalt you can try https://github.com/sirfz/tesserocr for your use case, but I am not sure what underlying version of the tesseract implementation is used.

Jun 05 '21 15:06 bozhodimitrov

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

Wow! It really works but It does not work for OpenCV users. They should convert their image into PIL Image.

Aug 29 '22 08:08 dilerbatu