OD2
OD2 copied to clipboard
Editing OCR after ingest
Descriptive summary
The ability to edit OCR post-ingest will allow correction of mistakes and improve searching, provide accurate transcripts, and could possibly be used for small transcription or crowdsourcing projects under the supervision of a curator.
Expected behavior
Authorized users (curators, depositors, admins) are able to access, edit and save the OCR for any object. The corrected OCR text is made available to display and for download.
Accessibility Concerns
Accurate OCR is key for accessibility
Corey noted that since the OCR is stored in hOCR format ( https://en.wikipedia.org/wiki/HOCR ), the direct editing of text in a large textbox would be tricky. A visual editor, would be much easier to use.
Possible hOCR visual editors:
- https://github.com/not-implemented/hocr-proofreader Demo: https://www.not-implemented.de/hocr-proofreader/
- https://github.com/GeReV/hocr-editor
POSM has reviewed and would like a list of user requirements from Metadeities to inform the selection of an editor.
Metadeities discussed and would like the following:
- a side-by-side editor view showing text in context of document and the OCR'd text as shown in both example editors above
- OCR editing available to Reviewer level users and above
- OCR text is viewable in editor to Depositor level users
- Need changes to OCR logged on work like other changes
Contingent asks:
- If editing text hierarchy of document and/or editing blank elements in document is useful to support accessibility features of OD, then would like editor to support those edits (see editor example 2 above)
- If OD ingests OCR completed outside of OD prior to ingest, need that OCR to be editable as well
Question for Features:
- Does OD ingest OCR completed prior to ingest? Adding accessibility features to pdfs before ingest can be robust and we would like to preserve those features at ingest if we don't already.