ocrodjvu
ocrodjvu copied to clipboard
OCR engine executable path should be configurable
Overview
On Void Linux, the tesseract binary resides at /usr/bin/tesseract-ocr
due to a naming conflict with the game Tesseract. It would be nice if the paths to the OCR engine could be explicitly specified, e.g. via a command line option, environment variable, or configuration file.
Version Information
$ ocrodjvu --version
ocrodjvu 0.11
+ Python 2.7.16
+ subprocess32
+ python-djvulibre 0.8.4
+ lxml 4.3.3
$ lsb_release --all
LSB Version: 1.0
Distributor ID: VoidLinux
Description: Void Linux
Release: rolling
Codename: void
Comments
For the moment, I am hacking around this issue by packing ocrodjvu on my distro with the following patch:
--- a/lib/engines/tesseract.py
+++ b/lib/engines/tesseract.py
@@ -111,7 +111,7 @@
image_format = image_io.TIFF
needs_utf8_fix = True
- executable = utils.property('tesseract')
+ executable = utils.property('tesseract-ocr')
extra_args = utils.property([], shlex.split)
use_hocr = utils.property(None, int)
fix_html = utils.property(0, int)
It's not documented at the moment, but you can specify the executable via command line with:
-X executable=tesseract-ocr
Oh! Nice. Thanks for the quick feedback. Are there any gotchas? If it's a reasonably stable option, would be nice to put it in the docs.
I considered using the Tesseract API (maybe through tesserocr), instead of using the CLI, which would would render the executable
setting meaningless.
But realistically, the switch to API is unlikely to happen in the foreseeable future.
Yes, -X executable=…
(and other -X
goodies) should be documented.