OD2 icon indicating copy to clipboard operation
OD2 copied to clipboard

Editing OCR after ingest

Open jsimic opened this issue 4 years ago • 3 comments

Descriptive summary

The ability to edit OCR post-ingest will allow correction of mistakes and improve searching, provide accurate transcripts, and could possibly be used for small transcription or crowdsourcing projects under the supervision of a curator.

Expected behavior

Authorized users (curators, depositors, admins) are able to access, edit and save the OCR for any object. The corrected OCR text is made available to display and for download.

Accessibility Concerns

Accurate OCR is key for accessibility

jsimic avatar Feb 04 '21 22:02 jsimic

Corey noted that since the OCR is stored in hOCR format ( https://en.wikipedia.org/wiki/HOCR ), the direct editing of text in a large textbox would be tricky. A visual editor, would be much easier to use.

Possible hOCR visual editors:

  • https://github.com/not-implemented/hocr-proofreader Demo: https://www.not-implemented.de/hocr-proofreader/
  • https://github.com/GeReV/hocr-editor

wickr avatar Feb 09 '21 23:02 wickr

POSM has reviewed and would like a list of user requirements from Metadeities to inform the selection of an editor.

jsimic avatar May 04 '21 19:05 jsimic

Metadeities discussed and would like the following:

  1. a side-by-side editor view showing text in context of document and the OCR'd text as shown in both example editors above
  2. OCR editing available to Reviewer level users and above
  3. OCR text is viewable in editor to Depositor level users
  4. Need changes to OCR logged on work like other changes

Contingent asks:

  1. If editing text hierarchy of document and/or editing blank elements in document is useful to support accessibility features of OD, then would like editor to support those edits (see editor example 2 above)
  2. If OD ingests OCR completed outside of OD prior to ingest, need that OCR to be editable as well

Question for Features:

  1. Does OD ingest OCR completed prior to ingest? Adding accessibility features to pdfs before ingest can be robust and we would like to preserve those features at ingest if we don't already.

KevinJonesMeta avatar Sep 05 '23 20:09 KevinJonesMeta