OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Add HOCR output as a sidecar option

Open parkerhancock opened this issue 7 years ago • 3 comments

I have many applications where the physical location of text on a page is significant, and an existing codebase built around the HOCR html format.

What would make this library completely killer is an option to produce a sidecar file of the hocr data from Tesseract. I know that Tesseract natively can produce HOCR data, so the change shouldn't be difficult. The only question is how to integrate that into the existing command line interface.

Maybe a new option for --sidecar-hocr?

Flipping through the codebase now to see if there's an easy option.

parkerhancock avatar Jul 27 '17 20:07 parkerhancock

ocrmypdf has three PDF renderers.

One of them is called the hocr renderer and uses HOCR as an intermediate format. For your use case it might make the most sense to use the older hocr renderer, since you intend to hocr for other things.

So

ocrmypdf -k --pdf-renderer hocr

which will output a temporary folder with all working files, including the hocr files per page. The main drawback of the hocr renderer is that its support for non-Latin script is poor.

If you'd prefer to force generation of hocr files using the new (and default) sandwich renderer (best PDF quality, requires Tesseract 3.05.01 or newer):

ocrmypdf -k --tesseract-config hocr <rest of your arguments>

I will think about adding an option for hocr sidecars that involves less hackery, but this should do it for now.

jbarlow83 avatar Jul 27 '17 22:07 jbarlow83

Any further thoughts on adding additional sidecar features?

andrewjfreyer avatar Aug 20 '19 18:08 andrewjfreyer

Your suggestion to use ocrmypdf -k --tesseract-config hocr <rest of your arguments> works great along with the keep-temporary-files=true. The only issue that I am currently having is where to find the temp files. The path is output while the ocr is running, but that temp path changes from run to run. Is there any way to query the ocrmypdf object to get the temp path so I know where to look for the .hocr file?

zweissman avatar Apr 21 '20 15:04 zweissman