ocrodjvu icon indicating copy to clipboard operation
ocrodjvu copied to clipboard

OCR engine executable path should be configurable

Open xelxebar opened this issue 5 years ago • 3 comments

Overview

On Void Linux, the tesseract binary resides at /usr/bin/tesseract-ocr due to a naming conflict with the game Tesseract. It would be nice if the paths to the OCR engine could be explicitly specified, e.g. via a command line option, environment variable, or configuration file.

Version Information

$ ocrodjvu --version
ocrodjvu 0.11
+ Python 2.7.16
+ subprocess32
+ python-djvulibre 0.8.4
+ lxml 4.3.3

$ lsb_release --all
LSB Version:	1.0
Distributor ID:	VoidLinux
Description:	Void Linux
Release:	rolling
Codename:	void

Comments

For the moment, I am hacking around this issue by packing ocrodjvu on my distro with the following patch:

--- a/lib/engines/tesseract.py
+++ b/lib/engines/tesseract.py
@@ -111,7 +111,7 @@
     image_format = image_io.TIFF
     needs_utf8_fix = True
 
-    executable = utils.property('tesseract')
+    executable = utils.property('tesseract-ocr')
     extra_args = utils.property([], shlex.split)
     use_hocr = utils.property(None, int)
     fix_html = utils.property(0, int)

xelxebar avatar Apr 24 '19 12:04 xelxebar

It's not documented at the moment, but you can specify the executable via command line with:

-X executable=tesseract-ocr

jwilk avatar Apr 24 '19 13:04 jwilk

Oh! Nice. Thanks for the quick feedback. Are there any gotchas? If it's a reasonably stable option, would be nice to put it in the docs.

xelxebar avatar Apr 24 '19 14:04 xelxebar

I considered using the Tesseract API (maybe through tesserocr), instead of using the CLI, which would would render the executable setting meaningless. But realistically, the switch to API is unlikely to happen in the foreseeable future.

Yes, -X executable=… (and other -X goodies) should be documented.

jwilk avatar May 01 '19 18:05 jwilk