spec
spec copied to clipboard
Skipping OCR processing based on logical `mets:structMap`
From my and @bertsky's discussion at https://github.com/qurator-spk/eynollah/issues/67:
Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.
For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.
100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)
- It should be possible to skip pages with
structMaptypes likespineorcolour_checker. (@maria-federbusch supplied us at SBB with a list, I'll copy it in here.) - What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?
For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.
That's a custom type used at SBB, invented by @maria-federbusch.
missing colour_checker.
That's a custom type used at SBB, invented by @maria-federbusch.
I was surprised to see it in mets-mods2tei, but not in kitodo.presentation or dfg-viewer. Maybe you want to open a PR for that?
- What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?
again, see previous discussion
One might think of an additional CLI option, say -G, --page-type, matching mets:structMap[@TYPE="LOGICAL"]//mets:div/@TYPE of pages in that range of the mets:structLink (if any), perhaps even with //-prefixed regexes.
But practically, there are too many positive cases to include, and only a few fixed negative ones: cover_front, cover_back, binding, spine, privileges, note.
So maybe we should just recommend ignoring all physical pages belonging to these page ranges in the implementation (and implement that behaviour for all Pythonic and bashlib processors in core)?
Additionally, I do use the information from physical containers.
We have often custom labeled containers alike Leerseite or Colorchecker ( :slightly_smiling_face: ) on this area.
If Image has been skipped due logical / physical mismatch, there's no FULLTEXT existing, and nothing linked in the physical container, too.
@M3ssman on the OCR-D Forum you said that you have a workflow to do page selection based on logical structmap externally (independent of OCR-D) – could you elaborate here?
It analyzes the METS and filters images by defined labels like the logical ones from DFG-structset like cover_front and cover_back and custom physical annotations like Colorchecker , Leerseite, Illustration and so on. For the later you are required to have this information present, for example it has been enriched by your digitization colleagues.
For rather small prints (<100 pages) this means saving 10% or more.
This relies on the fact, that each page is processed afterwards in separate ocrd-workspaces. Only for images which do not match the blacklisted label are those workspaces created. Afterwards only the existing OCR is enriched as FULLTEXT, leaving some pages empty. I did not experienced any drawbacks of this approach in the last half year.
Works also when creating new PDFs from resulting ALTO-Data using derivans tool.
To enhance this for complete ocrd-workspaces, one could probably combine this even with lazy-loading to don't download these images locally.
ok, so in principle it's clear that if you use the split recipe (dividing up the METS into single-page workspaces to be processed in parallel), then it is easy to filter by logical page type. (Still, I was hoping for some concrete technical details.)
Getting back to the question how to do this with OCR-D: @kba, can you please weigh in (esp. whether we should do this with positive/negative filters on a new CLI option, or rather by implicit filtering in core, perhaps even configurable...)?