hocrviewer-mirador
hocrviewer-mirador copied to clipboard
View HOCR files with Mirador
HOCRViewer
Read books in HOCR format with Mirador.
Requirements
- Python 3.5
- Optional: An SQLite version that supports FTS5 (check with
sqlite3 ":memory:" "PRAGMA compile_options;" |grep FTS5)
Installation
$ pip install -r requirements.txt
Data format
The HOCR file must contain all pages as ocr_page elements. These must have
a title attribute that contains the following fields (as per the
HOCR Specification):
ppageno: The physical page numberimage: The relative path (from the HOCR file) to the page imagebbox: The dimensions of the image
Additionally, each ocr_page element must have an id attribute that
assigns a unique identifier to the page.
Example:
<div class="ocr_page" id="page_0005"
title="ppageno 4; image spyri_heidi_1880/00000005.tif; bbox 0 0 2013 2985"/>
Alternatively, HOCR files with accompanying images that are stored like the Google 1000 Books dataset (download instructions) can be indexed and viewed as well.
Usage
Simply point the application to a directory containing hOCR files and it will serve a web interface where you can view them:
$ python hocrviewer.py serve /mnt/data/hocr
You can alternatively index your files before serving them. This has two main advantages: It significantly reduces the response times for the manifests and annotations and it enables the search within the books (not yet usable from Mirador, but keep an eye on this PR).
To do so, run the index subcommand with the path to the directory with
your HOCR files as the first argument. By default, the database will be
written to ~/.config/hocrviewer/hocrviewer.db, but you can override this
with the --db-path option that is passed before the subcommand:
$ python hocrviewer.py --db-path /tmp/test.db index /mnt/data/hocr
After the index has been created, run the application with the serve
subcommand (making sure that you pass the same --db-path value as during
indexing).
$ python hocrviewer.py --db-path /tmp/test.db serve
The application exposes all books as IIIF manifests at
/iiif/<book_name>, where book_name is the file name of the HOCR file
for the book without the .html extension.
Planned Features
- Search across all books (backend done, user interface missing)
- Edit OCR with a custom
AnnotationEditorimplementation for Mirador - Browse books in a paginated view outside of Mirador (which gets overwhelmed with large libraries)