pytesseract icon indicating copy to clipboard operation
pytesseract copied to clipboard

Research the option of using stdin/stdout instead saving image on disk

Open cgallay opened this issue 6 years ago • 14 comments

Hi, I am wondering why you don't use stdin argument to send the image to tesseract instead of saving it on the disk? https://github.com/madmaze/pytesseract/blob/25a9d38649f6d9f907f9c6750cab03d699d0b340/src/pytesseract.py#L208

cgallay avatar Jan 02 '19 06:01 cgallay

Hi @cgallay short answer: that was the initial implementation I agree that it's not the optimal solution and maybe it should be used only for debugging purposes.

This question is also relevant for the stdout.

bozhodimitrov avatar Jan 02 '19 18:01 bozhodimitrov

I found some issues with the tesseract stdin/stout and some modes/versions are affected. For reference: tesseract-ocr/tesseract#785 , tesseract-ocr/tesseract#85 and etc.

bozhodimitrov avatar Jan 04 '19 15:01 bozhodimitrov

It seems that the problems with the stdin are fixed in tesseract 5.0, but testing is needed.

bozhodimitrov avatar Jul 26 '19 11:07 bozhodimitrov

Hi, wanted to ask if this feature is added in the latest release of pytesseract. Saving image on the disk the purpose of running the tesseract command is quite slow as of now ( I think that is what PyTesseract is doing right now ). Can I work on this feature if its not implemented ( will need help from you to implement this ).

AyushP123 avatar Oct 30 '19 05:10 AyushP123

At the moment pytesseract supports passing the path of the images itself, which will skip creating temp files on disk. You can also pass the path to a text file with list of images for batch processing - this will also skip the pytesseract temp files. Both of those options are documented.

About the feature itself - we need to first test the tesseract stdin on most recent versions in order to be sure that it will work correctly with pytesseract.

bozhodimitrov avatar Oct 30 '19 08:10 bozhodimitrov

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

j-hap avatar Mar 11 '21 07:03 j-hap

Nice, you can always monkey patch the module level function and make it work for you. One problem here is that this functionality might be limited to the newer versions of tesseract and we should consider the older 3.x versions too.

bozhodimitrov avatar Mar 11 '21 10:03 bozhodimitrov

I just compiled version 3.05.02 from https://github.com/tesseract-ocr/tesseract/tree/3.05.02 and ran the same modification as above with no problems with traineddata from https://github.com/tesseract-ocr/tessdata/tree/3.04.00. I know it's just a sample and not a full fledged test, but maybe that helps.

j-hap avatar Mar 12 '21 07:03 j-hap

Yes, this is very helpful. This means that this feature can be added to pytesseract. But we will need to add additional tests in order to be sure that the other functionality doesn't break. By other functionality, I mean the option to pass image paths as raw strings to pytesseract functions. Also, I am not sure if tesseract will honor the configuration options in combination with the stdin input.

And finally, based on the above - we should decide if this will be the default implementation (and if we should keep the old one around or not).

bozhodimitrov avatar Mar 12 '21 11:03 bozhodimitrov

I just wrote a CLI wrapper for self-using, which use stdin and stdout to communicate with tesseract executable. Also I convert the tsv output into a more intuitive dict structure. The wrapper object could be fed with image in different flavors: image_file_path, image_bytes, PIL image object, ndarray. No temp file created for input or output. https://github.com/mo-han/mo-han-toolbox/blob/master/mylib/wrapper/tesseract_ocr.py

from mylib.wrapper import tesseract_ocr
from PIL import Image

t=tesseract_ocr.TesseractOCRCLIWrapper(r'C:\Users\mo-han\AppData\Local\Programs\Tesseract-OCR\tesseract.exe')
t.set_language('chi_sim', 'eng').set_image_object(Image.open('r:1.jpg')).get_ocr_tsv_to_dict(psm=3, min_confidence=0.8)

[{'text': '名',
  'confidence': 0.91,
  'box': ((320, 10), (333, 10), (333, 23), (320, 23)),
  'page block paragraph line word level': (1, 1, 1, 1, 1, 5)},
 {'text': '称',
  'confidence': 0.87,
  'box': ((320, 28), (333, 28), (333, 40), (320, 40)),
  'page block paragraph line word level': (1, 1, 1, 1, 2, 5)},
 {'text': '修改',
  'confidence': 0.9,
  'box': ((320, 336), (333, 336), (333, 366), (320, 366)),
  'page block paragraph line word level': (1, 1, 1, 1, 3, 5)},
 {'text': '日',
  'confidence': 0.96,
  'box': ((320, 372), (333, 372), (333, 379), (320, 379)),
  'page block paragraph line word level': (1, 1, 1, 1, 4, 5)},
 {'text': '期',
  'confidence': 0.96,
  'box': ((320, 387), (333, 387), (333, 395), (320, 395)),
  'page block paragraph line word level': (1, 1, 1, 1, 5, 5)},
 {'text': ' ',
  'confidence': 0.95,
  'box': ((0, 195), (0, 195), (0, 224), (0, 224)),
  'page block paragraph line word level': (1, 2, 1, 1, 1, 5)},
 {'text': 'label_cn.txt',
  'confidence': 0.8,
  'box': ((283, 38), (295, 38), (295, 120), (283, 120)),
  'page block paragraph line word level': (1, 3, 1, 1, 1, 5)},
 {'text': '2019/8/5',
  'confidence': 0.87,
  'box': ((281, 337), (294, 337), (294, 401), (281, 401)),
  'page block paragraph line word level': (1, 3, 1, 1, 2, 5)},
 {'text': '15:24',
  'confidence': 0.95,
  'box': ((283, 408), (294, 408), (294, 446), (283, 446)),
  'page block paragraph line word level': (1, 3, 1, 1, 3, 5)},
 {'text': '2019/8/5',
  'confidence': 0.91,
  'box': ((254, 337), (267, 337), (267, 401), (254, 401)),
  'page block paragraph line word level': (1, 3, 1, 2, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((256, 408), (267, 408), (267, 446), (256, 446)),
  'page block paragraph line word level': (1, 3, 1, 2, 4, 5)},
 {'text': '2019/8/5',
  'confidence': 0.89,
  'box': ((227, 337), (240, 337), (240, 401), (227, 401)),
  'page block paragraph line word level': (1, 3, 1, 3, 3, 5)},
 {'text': '15:24',
  'confidence': 0.96,
  'box': ((229, 408), (240, 408), (240, 446), (229, 446)),
  'page block paragraph line word level': (1, 3, 1, 3, 4, 5)}]

mo-han avatar Apr 01 '21 04:04 mo-han

Hi there, Having this implemented would be very useful as me and another dev are trying to read frames in a video and having a 400ms process time for each frame times 30fps for the video leads to very long process times. I have a OpenCV image in Python and would like to just pass that directly into Tesseract instead of having it saved on the disk.

GreenCobalt avatar Jun 04 '21 22:06 GreenCobalt

@GreenCobalt you could try my example code above

mo-han avatar Jun 04 '21 23:06 mo-han

@GreenCobalt you can try https://github.com/sirfz/tesserocr for your use case, but I am not sure what underlying version of the tesseract implementation is used.

bozhodimitrov avatar Jun 05 '21 15:06 bozhodimitrov

I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.

def run_and_get_output(
    image,
    extension='',
    lang=None,
    config='',
    nice=0,
    timeout=0,
    return_bytes=False,
):

    cmd_args = [tesseract_cmd, 'stdin', 'stdout']

    if not sys.platform.startswith('win32') and nice != 0:
        cmd_args += ('nice', '-n', str(nice))

    if lang is not None:
        cmd_args += ('-l', lang)

    if config:
        cmd_args += shlex.split(config)

    try:
        proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
        image.save(proc.stdin, 'PNG')
        (stdout_data, stderr_data) = proc.communicate()
        return stdout_data.decode(DEFAULT_ENCODING)
    except OSError as e:
        if e.errno != ENOENT:
            raise e
        raise TesseractNotFoundError()

    with timeout_manager(proc, timeout) as error_string:
        if proc.returncode:
            raise TesseractError(proc.returncode, get_errors(error_string))

Wow! It really works but It does not work for OpenCV users. They should convert their image into PIL Image.

dilerbatu avatar Aug 29 '22 08:08 dilerbatu