OCR and Annotations in "Basic Newspaper": How to tell they're the same text?

Open jbaiter opened this issue 2 years ago • 3 comments

I'm currently encountering a minor issue with the way the OCR is referenced in the "Basic Newspapers" recipe.

For one, it's provided as an ALTO XML resource referenced in the rendering property. But additionally, it's provided as individual line annotations in the Canvas' annotations at https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-anno_p1.json.

Now the issue arises when a generic "Content Search API" indexer that supports both OCR and Annotations tries to index this canvas. Since the annotations make in no way clear that they contain the same text as the OCR, both will be indexed, and users will get duplicate search results for a content search in the canvas as a result.

Is there a way to make it clearer that the annotations are "the page content as text" (iirc there was a cnt:ContentAsText in IIIFv2?) so indexers can check for it?

Oct 06 '23 13:10 jbaiter