fromthepage icon indicating copy to clipboard operation
fromthepage copied to clipboard

Switch new Internet Archive Integration to use the IIIF interface

Open saracarl opened this issue 7 years ago • 6 comments

We should keep the UI for starting a project by putting in the IA work URL, but under the covers we should "convert" or "derive" the IIIF manifest ID for that work.

Open question: For BHL work, are the leaf numbers in the IIIF manifest canvas data? If not, this needs to be rethought.

saracarl avatar Oct 20 '17 14:10 saracarl

I can create a separate issue if preferred, but when this is done, the wiki page on importing from IA should also be updated.

bencomp avatar May 04 '18 13:05 bencomp

Thanks for the comment. Updating the wiki page belongs in this issue.

benwbrum avatar May 04 '18 21:05 benwbrum

When we do get around to doing this, we will need to deal with the existing integration points that pull OCR resources from the Internet Archive during an import. There is a parallel in the integration work we did for CONTENTdm, in which,

  1. A user sees a clear "Import from Internet Archive" flow they can use to cut-and-paste an IA URL directly, so that they do not need to determine the IIIF manifest URI from the Internet Archive website. This flow derives the IIIF manifest and redirects them to the IIIF manifest importer.
  2. The IIIF manifest importer recognizes the archive.org URI in the manifest and offers OCR import flow.
  3. If OCR import is checked, the IIIF import is done, then a new post-processing job is launched to import OCR text.

benwbrum avatar May 04 '18 22:05 benwbrum

We should be able to do this by more-or-less duplicating the import from ContentDM workflow & UI, including the OCR import.

saracarl avatar May 07 '18 13:05 saracarl

If we wait for the IA IIIF implementation to support OCR text as annotations, we will have feature parity without building extra code and won't have to follow the CONTENTdm model.

benwbrum avatar Oct 16 '23 23:10 benwbrum

The IA IIIF implementation now supports OCR text as annotations. However, these are exposed at the word level rather than the page level, so we may need new code to parse those structures into lines.

benwbrum avatar Sep 01 '24 14:09 benwbrum