cookbook-recipes
cookbook-recipes copied to clipboard
OCR and Annotations in "Basic Newspaper": How to tell they're the same text?
I'm currently encountering a minor issue with the way the OCR is referenced in the "Basic Newspapers" recipe.
For one, it's provided as an ALTO XML resource referenced in the rendering
property. But additionally, it's provided as individual line annotations in the Canvas' annotations
at https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-anno_p1.json.
Now the issue arises when a generic "Content Search API" indexer that supports both OCR and Annotations tries to index this canvas. Since the annotations make in no way clear that they contain the same text as the OCR, both will be indexed, and users will get duplicate search results for a content search in the canvas as a result.
Is there a way to make it clearer that the annotations are "the page content as text" (iirc there was a cnt:ContentAsText
in IIIFv2?) so indexers can check for it?
Can you use the fact the annotations are "motivation": "supplementing"
or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?
Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?
We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?
Can you use the fact the annotations are "motivation": "supplementing" or is that not specific enough? There is a new motivation TSG being formed that might coin a transicrption motivation. Would that solve the issue?
I'm afraid supplementing
is not specific enough, since the supplementing
annotation could also be e.g. a translation of the text on the canvas (if I understood the spec correctly). A transcription
motivation would indeed solve the issue, since I could simply ignore these in presence of a OCR rendering
👍🏾
Do we need some link between the annotaitons and the ALTO to say they are different formats of the same text?
I think the transcription
motivation would probably be enough, something more advanced like this sounds like it could cause a lot more headaches than a simple motivation 😅
We could add a label to the annotaiton page to say its OCR data and then could your interface let the user choose which one they want?
In my use case no, since the indexing is a fully automatic process without user interaction. And selecting between different indices at query time is afaik not supported by the Content Search API (except for the motivation
query parameter).