pytesseract
pytesseract copied to clipboard
Research the option of using stdin/stdout instead saving image on disk
Hi, I am wondering why you don't use stdin argument to send the image to tesseract instead of saving it on the disk? https://github.com/madmaze/pytesseract/blob/25a9d38649f6d9f907f9c6750cab03d699d0b340/src/pytesseract.py#L208
Hi @cgallay short answer: that was the initial implementation I agree that it's not the optimal solution and maybe it should be used only for debugging purposes.
This question is also relevant for the stdout.
I found some issues with the tesseract stdin/stout and some modes/versions are affected. For reference: tesseract-ocr/tesseract#785 , tesseract-ocr/tesseract#85 and etc.
It seems that the problems with the stdin are fixed in tesseract 5.0, but testing is needed.
Hi, wanted to ask if this feature is added in the latest release of pytesseract. Saving image on the disk the purpose of running the tesseract command is quite slow as of now ( I think that is what PyTesseract is doing right now ). Can I work on this feature if its not implemented ( will need help from you to implement this ).
At the moment pytesseract supports passing the path of the images itself, which will skip creating temp files on disk. You can also pass the path to a text file with list of images for batch processing - this will also skip the pytesseract temp files. Both of those options are documented.
About the feature itself - we need to first test the tesseract stdin on most recent versions in order to be sure that it will work correctly with pytesseract.
I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.
def run_and_get_output(
image,
extension='',
lang=None,
config='',
nice=0,
timeout=0,
return_bytes=False,
):
cmd_args = [tesseract_cmd, 'stdin', 'stdout']
if not sys.platform.startswith('win32') and nice != 0:
cmd_args += ('nice', '-n', str(nice))
if lang is not None:
cmd_args += ('-l', lang)
if config:
cmd_args += shlex.split(config)
try:
proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
image.save(proc.stdin, 'PNG')
(stdout_data, stderr_data) = proc.communicate()
return stdout_data.decode(DEFAULT_ENCODING)
except OSError as e:
if e.errno != ENOENT:
raise e
raise TesseractNotFoundError()
with timeout_manager(proc, timeout) as error_string:
if proc.returncode:
raise TesseractError(proc.returncode, get_errors(error_string))
Nice, you can always monkey patch the module level function and make it work for you. One problem here is that this functionality might be limited to the newer versions of tesseract and we should consider the older 3.x versions too.
I just compiled version 3.05.02 from https://github.com/tesseract-ocr/tesseract/tree/3.05.02 and ran the same modification as above with no problems with traineddata from https://github.com/tesseract-ocr/tessdata/tree/3.04.00. I know it's just a sample and not a full fledged test, but maybe that helps.
Yes, this is very helpful. This means that this feature can be added to pytesseract.
But we will need to add additional tests in order to be sure that the other functionality doesn't break.
By other functionality, I mean the option to pass image paths as raw strings to pytesseract functions.
Also, I am not sure if tesseract will honor the configuration options in combination with the stdin input.
And finally, based on the above - we should decide if this will be the default implementation (and if we should keep the old one around or not).
I just wrote a CLI wrapper for self-using, which use stdin and stdout to communicate with tesseract executable. Also I convert the tsv output into a more intuitive dict structure. The wrapper object could be fed with image in different flavors: image_file_path, image_bytes, PIL image object, ndarray. No temp file created for input or output.
https://github.com/mo-han/mo-han-toolbox/blob/master/mylib/wrapper/tesseract_ocr.py
from mylib.wrapper import tesseract_ocr
from PIL import Image
t=tesseract_ocr.TesseractOCRCLIWrapper(r'C:\Users\mo-han\AppData\Local\Programs\Tesseract-OCR\tesseract.exe')
t.set_language('chi_sim', 'eng').set_image_object(Image.open('r:1.jpg')).get_ocr_tsv_to_dict(psm=3, min_confidence=0.8)
[{'text': '名',
'confidence': 0.91,
'box': ((320, 10), (333, 10), (333, 23), (320, 23)),
'page block paragraph line word level': (1, 1, 1, 1, 1, 5)},
{'text': '称',
'confidence': 0.87,
'box': ((320, 28), (333, 28), (333, 40), (320, 40)),
'page block paragraph line word level': (1, 1, 1, 1, 2, 5)},
{'text': '修改',
'confidence': 0.9,
'box': ((320, 336), (333, 336), (333, 366), (320, 366)),
'page block paragraph line word level': (1, 1, 1, 1, 3, 5)},
{'text': '日',
'confidence': 0.96,
'box': ((320, 372), (333, 372), (333, 379), (320, 379)),
'page block paragraph line word level': (1, 1, 1, 1, 4, 5)},
{'text': '期',
'confidence': 0.96,
'box': ((320, 387), (333, 387), (333, 395), (320, 395)),
'page block paragraph line word level': (1, 1, 1, 1, 5, 5)},
{'text': ' ',
'confidence': 0.95,
'box': ((0, 195), (0, 195), (0, 224), (0, 224)),
'page block paragraph line word level': (1, 2, 1, 1, 1, 5)},
{'text': 'label_cn.txt',
'confidence': 0.8,
'box': ((283, 38), (295, 38), (295, 120), (283, 120)),
'page block paragraph line word level': (1, 3, 1, 1, 1, 5)},
{'text': '2019/8/5',
'confidence': 0.87,
'box': ((281, 337), (294, 337), (294, 401), (281, 401)),
'page block paragraph line word level': (1, 3, 1, 1, 2, 5)},
{'text': '15:24',
'confidence': 0.95,
'box': ((283, 408), (294, 408), (294, 446), (283, 446)),
'page block paragraph line word level': (1, 3, 1, 1, 3, 5)},
{'text': '2019/8/5',
'confidence': 0.91,
'box': ((254, 337), (267, 337), (267, 401), (254, 401)),
'page block paragraph line word level': (1, 3, 1, 2, 3, 5)},
{'text': '15:24',
'confidence': 0.96,
'box': ((256, 408), (267, 408), (267, 446), (256, 446)),
'page block paragraph line word level': (1, 3, 1, 2, 4, 5)},
{'text': '2019/8/5',
'confidence': 0.89,
'box': ((227, 337), (240, 337), (240, 401), (227, 401)),
'page block paragraph line word level': (1, 3, 1, 3, 3, 5)},
{'text': '15:24',
'confidence': 0.96,
'box': ((229, 408), (240, 408), (240, 446), (229, 446)),
'page block paragraph line word level': (1, 3, 1, 3, 4, 5)}]
Hi there, Having this implemented would be very useful as me and another dev are trying to read frames in a video and having a 400ms process time for each frame times 30fps for the video leads to very long process times. I have a OpenCV image in Python and would like to just pass that directly into Tesseract instead of having it saved on the disk.
@GreenCobalt you could try my example code above
@GreenCobalt you can try https://github.com/sirfz/tesserocr for your use case, but I am not sure what underlying version of the tesseract implementation is used.
I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.
def run_and_get_output( image, extension='', lang=None, config='', nice=0, timeout=0, return_bytes=False, ): cmd_args = [tesseract_cmd, 'stdin', 'stdout'] if not sys.platform.startswith('win32') and nice != 0: cmd_args += ('nice', '-n', str(nice)) if lang is not None: cmd_args += ('-l', lang) if config: cmd_args += shlex.split(config) try: proc = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE) image.save(proc.stdin, 'PNG') (stdout_data, stderr_data) = proc.communicate() return stdout_data.decode(DEFAULT_ENCODING) except OSError as e: if e.errno != ENOENT: raise e raise TesseractNotFoundError() with timeout_manager(proc, timeout) as error_string: if proc.returncode: raise TesseractError(proc.returncode, get_errors(error_string))
Wow! It really works but It does not work for OpenCV users. They should convert their image into PIL Image.