pyocr
pyocr copied to clipboard
pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n")
Good day,
I'm using pyocr through Paperless on a Ubuntu setup. I'm using the tesseract-ocr PPA [0] and on the latest version [1] pyocr throws an error.
[0]
cat /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr-artful.list
deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu artful main
[1]
tesseract --version
tesseract 4.0.0-beta.1-302-g3aa9
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.3.0
Traceback:
littlebig@littlebig:~/Dev/paperless$ python3 /home/littlebig/Dev/paperless/src/manage.py document_consumer
Starting document consumer at /home/littlebig/paperless_consumption_dir with inotify
Parsers available: RasterisedDocumentParser
Consuming /home/littlebig/paperless_consumption_dir/BRW90CDB68D60F5_000798.pdf
Processing sheet #1: /tmp/paperless/paperless-b5bgnwtm/convert-0000.pnm -> /tmp/paperless/paperless-b5bgnwtm/convert-0000.unpaper.pnm
[pgm_pipe @ 0x55cbcbdfb980] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55cbcbe00140] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55cbcbe00140] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 290, in image_to_string
return ocr.image_to_string(f, lang=lang)
File "/home/littlebig/.local/lib/python3.6/site-packages/pyocr/tesseract.py", line 367, in image_to_string
raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/littlebig/Dev/paperless/src/manage.py", line 18, in <module>
execute_from_command_line(sys.argv)
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
utility.execute()
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 98, in handle
self.loop_inotify(mail_delta)
File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 131, in loop_inotify
self.loop_step(mail_delta)
File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 123, in loop_step
self.file_consumer.consume_new_files()
File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 107, in consume_new_files
if not self.try_consume_file(file):
File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 145, in try_consume_file
date = parsed_document.get_date()
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
text = self.get_text()
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
self._text = self._get_ocr(images)
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
littlebig@littlebig:~/Dev/paperless$
Has anyone else come across this? Thanks!
Having a look through the pyocr sources this stands out to me:
src/pyocr/builders.py
307- file_ext = ["txt"]
308: tess_flags = ["-psm", str(tesseract_layout)]
309- cun_args = ["-f", "text"]
--
564- file_ext = ["html", "hocr"]
565: tess_flags = ["-psm", str(tesseract_layout)]
566- tess_conf = ["hocr"]
--
640- file_ext = ["html", "hocr"]
641: tess_flags = ["-psm", str(tesseract_layout)]
642- tess_conf = ["hocr"]
Does pyocr just use -psm
instead of --psm
as the parameter? I'm wondering whether that is not accepted anymore now.
Does pyocr just use -psm instead of --psm as the parameter? I'm wondering whether that is not accepted anymore now.
It looks like this is the problem. I have changed the passed options in builds.py
to provide --psm
instead of -psm
and it works fine now. I might create a pull request for this though I'm not sure whether there are any other implications of this.
The commit in question in tesseract is the following: https://github.com/tesseract-ocr/tesseract/commit/ee201e1f4fa277a4b2ecd751a45d3bf1eba6dfdb
I also came across this today. I note that -psm is used not just in builders.py but also in tesseract.py.
https://github.com/openpaperwork/pyocr/pull/100
I haven't had a chance yet to work out the circular import statements that I introduced in https://github.com/ddddavidmartin/pyocr/tree/update_deprecated_psm_option_string. If anyone wants to step in, feel free to give it a go.
For now, a quick and dirty fix is to just apply https://github.com/openpaperwork/pyocr/pull/100/commits/c136838b46cf49f06ac1dc5f2f9bc16232c11213.