archive-hocr-tools
archive-hocr-tools copied to clipboard
Efficient hOCR tooling
If your confidence is not a whole number then parsing it throws an Exception at line 186 of parse.py ``` Traceback (most recent call last): File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/bin/recode_pdf", line 302, in...
Many of the tools currently cannot work in special files in `/dev/stdin` in bash, or in general accept files from `stdin`, this is because of some unnecessary seeks. Additionally, it...
This commit handles cases where no `pageType` is detected by skipping the page.
This commit adds support for converting to two characters ISO 639 Part2b languages, e.g. `fre` for French rather than the Part3 `fra`. IA items will often include `fre`, `ger`, etc.,...
This PR adds two commits to address two separate `epubcheck` validation error. The first relates to the mediatype (and HTML escaping), and the second relates to the table of contents....
This commit uses the item `identifier` as the book title if the item is lacking a `title` in its metadata. The DASIY spec requires a title: https://daisy.org/activities/standards/daisy/daisy-3/z39-86-2005-r2012-specifications-for-the-digital-talking-book/