spec icon indicating copy to clipboard operation
spec copied to clipboard

Skipping OCR processing based on logical `mets:structMap`

Open mikegerber opened this issue 3 years ago • 8 comments
trafficstars

From my and @bertsky's discussion at https://github.com/qurator-spk/eynollah/issues/67:

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

  • It should be possible to skip pages with structMap types like spine or colour_checker. (@maria-federbusch supplied us at SBB with a list, I'll copy it in here.)
  • What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?

mikegerber avatar Feb 22 '22 12:02 mikegerber

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

That's a custom type used at SBB, invented by @maria-federbusch.

mikegerber avatar Feb 22 '22 14:02 mikegerber

missing colour_checker.

That's a custom type used at SBB, invented by @maria-federbusch.

I was surprised to see it in mets-mods2tei, but not in kitodo.presentation or dfg-viewer. Maybe you want to open a PR for that?

  • What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?

again, see previous discussion

bertsky avatar Feb 22 '22 15:02 bertsky

One might think of an additional CLI option, say -G, --page-type, matching mets:structMap[@TYPE="LOGICAL"]//mets:div/@TYPE of pages in that range of the mets:structLink (if any), perhaps even with //-prefixed regexes.

But practically, there are too many positive cases to include, and only a few fixed negative ones: cover_front, cover_back, binding, spine, privileges, note.

So maybe we should just recommend ignoring all physical pages belonging to these page ranges in the implementation (and implement that behaviour for all Pythonic and bashlib processors in core)?

bertsky avatar Aug 16 '22 14:08 bertsky

Additionally, I do use the information from physical containers. We have often custom labeled containers alike Leerseite or Colorchecker ( :slightly_smiling_face: ) on this area.

M3ssman avatar Oct 07 '22 06:10 M3ssman

If Image has been skipped due logical / physical mismatch, there's no FULLTEXT existing, and nothing linked in the physical container, too.

M3ssman avatar Oct 07 '22 07:10 M3ssman

@M3ssman on the OCR-D Forum you said that you have a workflow to do page selection based on logical structmap externally (independent of OCR-D) – could you elaborate here?

bertsky avatar Apr 14 '23 08:04 bertsky

It analyzes the METS and filters images by defined labels like the logical ones from DFG-structset like cover_front and cover_back and custom physical annotations like Colorchecker , Leerseite, Illustration and so on. For the later you are required to have this information present, for example it has been enriched by your digitization colleagues. For rather small prints (<100 pages) this means saving 10% or more.

This relies on the fact, that each page is processed afterwards in separate ocrd-workspaces. Only for images which do not match the blacklisted label are those workspaces created. Afterwards only the existing OCR is enriched as FULLTEXT, leaving some pages empty. I did not experienced any drawbacks of this approach in the last half year.

Works also when creating new PDFs from resulting ALTO-Data using derivans tool.

To enhance this for complete ocrd-workspaces, one could probably combine this even with lazy-loading to don't download these images locally.

M3ssman avatar Apr 14 '23 10:04 M3ssman

ok, so in principle it's clear that if you use the split recipe (dividing up the METS into single-page workspaces to be processed in parallel), then it is easy to filter by logical page type. (Still, I was hoping for some concrete technical details.)

Getting back to the question how to do this with OCR-D: @kba, can you please weigh in (esp. whether we should do this with positive/negative filters on a new CLI option, or rather by implicit filtering in core, perhaps even configurable...)?

bertsky avatar Apr 14 '23 13:04 bertsky