fromthepage
fromthepage copied to clipboard
Switch new Internet Archive Integration to use the IIIF interface
We should keep the UI for starting a project by putting in the IA work URL, but under the covers we should "convert" or "derive" the IIIF manifest ID for that work.
Open question: For BHL work, are the leaf numbers in the IIIF manifest canvas data? If not, this needs to be rethought.
I can create a separate issue if preferred, but when this is done, the wiki page on importing from IA should also be updated.
Thanks for the comment. Updating the wiki page belongs in this issue.
When we do get around to doing this, we will need to deal with the existing integration points that pull OCR resources from the Internet Archive during an import. There is a parallel in the integration work we did for CONTENTdm, in which,
- A user sees a clear "Import from Internet Archive" flow they can use to cut-and-paste an IA URL directly, so that they do not need to determine the IIIF manifest URI from the Internet Archive website. This flow derives the IIIF manifest and redirects them to the IIIF manifest importer.
- The IIIF manifest importer recognizes the
archive.org
URI in the manifest and offers OCR import flow. - If OCR import is checked, the IIIF import is done, then a new post-processing job is launched to import OCR text.
We should be able to do this by more-or-less duplicating the import from ContentDM workflow & UI, including the OCR import.
If we wait for the IA IIIF implementation to support OCR text as annotations, we will have feature parity without building extra code and won't have to follow the CONTENTdm model.
The IA IIIF implementation now supports OCR text as annotations. However, these are exposed at the word level rather than the page level, so we may need new code to parse those structures into lines.