Add hOCR output format
This change adds rudimentary hOCR output support. Notes:
-
Currently it just adds bounding boxes, not baselines (which are also supported) to the hOCR output
-
It doesn't add any semantic layout stuff; instead, it just represents each word as an
ocrx_word -
Some of the metadata could be improved, such as adding the real image name and perhaps EasyOCR version number
-
I didn't check if EasyOCR supports multipage inputs; this will certainly break with those if it does
-
I left this comment in the source code; I'm not sure what to do with it (probably shouldn't be enabled by default):
# In order to get a browser-renderable HTML file, you can add this before the closing </body> tag:
#
# <script src="https://unpkg.com/hocrjs"></script>
Other than that, I validated the output with hocr-check from https://github.com/ocropus/hocr-tools and also checked that it validates as XHTML.