spec additional CLI/METS/ocrd-tool.json specs for quality estimator tools and annotations

trafficstars

We all know we need some form of quality estimates to control where computation is spent and what workflow steps are used.

External quality control

One might consider this as a problem external to workflow configuration and module implementation. In that interpretation, it remains the workflow engine's job to test results, and to enter alternative pre-defined workflow paths – or give up a workspace – when they fail.

So the user could still influence computation by providing configurations with switches/conditionals. But that influence is rather limited – you could not say where to check, how, or with which models.

Also, module implementators cannot contribute their expert knowledge about how to get good quality estimates.

OCR-D quality control

Alternatively, one might want to model these tests explicitly, defining and managing specialised tools for testing, and configuring their usage in workflows along with the processors themselves.

So – at this point I am repeating a proposal made in discussing #171 – …

How about we introduced a dedicated CLI for OCR-D workflow quality estimators analogous to OCR-D workflow processors? Modules could bundle their knowledge about what a good result is for a particular processor along with everything else. And specialized modules could provide QE tools by the bunch. Let's call them ~~benchmarks~~ evaluators for the time being. We are primarily interested in an ~~benchmark's~~ evaluator's score, which needs to be compared to some threshold to determine if the processor's result was "okay" enough to continue processing. That threshold could be different, depending on the workflow configuration. ~~Benchmarks~~ Evaluators could also have configurable parameters (like model files) of their own.

An important question is how to deal with page-wise vs document-wise quality. Do we want to stop if even a single page indicated failure? Or does it take a bad average? Or a pro-longed series of bad pages? Or a min-worst-page / min-average kind of set-up?

Regardless, besides the binary score > threshold check we might also be interested in additional, verbose output on what was analysed and what patterns were found – either as part of logging, or as a structured report. So we also must allow creating an output fileGrp.

Now let's assume a call to a ~~benchmark~~ evaluator looks like this:

ocrdtest-cis-ocropy-binarization -I OCR-D-BIN-WOLF -O OCR-D-BIN-WOLF-TEST-OCRO -P method component-statistics -P threshold 0.9

(with its return value indicating success or failure, just as the return value of a processor).

Then we could write dynamic workflows like so:

if BIN=wolf
  BINARIZED=OCR-D-BIN-$BIN
  ocrd-olena-binarize -I OCR-D-IMG -O $BINARIZED -P impl $BIN
  ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9
then : 
elif BIN=ocropy
  # try another algorithm
  BINARIZED=OCR-D-BIN-$BIN
  ocrd-cis-ocropy-binarize -I OCR-D-IMG -O $BINARIZED -P method $BIN -P level-of-operation page
  ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9
then : 
elif BIN=wolf
  # try another parameter
  BINARIZED=OCR-D-BIN-$BIN-HIGH
  ocrd-olena-binarize -I OCR-D-IMG -O $BINARIZED -P impl $BIN -P k 0.1
  ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9
then : 
else 
  # give up the workflow
  exit 1
fi

# without extra evaluator (only completion)
ocrd-tesserocr-deskew -I $BINARIZED -O OCR-D-DESK -P operation_level page

# try different cropping tools with successively lower expecations
if CROPPED=OCR-D-CROP-TESS
  ocrd-tesserocr-crop -I OCR-D-DESK -O $CROPPED
  ocrdtest-leptonica-cropping -I $CROPPED -P threshold 0.7
then : 
elif CROPPED=OCR-D-CROP-ANY
  ocrd-anybaseocr-crop -I OCR-D-DESK -O $CROPPED
  ocrdtest-leptonica-cropping -I $CROPPED -P threshold 0.5
then : 
else 
  # omit cropping
  CROPPED=OCR-D-DESK
fi

ocrd-tesserocr-segment-region -I $CROPPED -O OCR-D-SEG-TESS 

# and so on...

(borrowing the sh -e convention that any non-zero return value causes the workflow to fail, except if it was immediate to a conditional expression)

That would be fully dynamic, but still not allow arbitrary information flow like determining model names. For the latter, some notion of functional CLI would be needed...

Originally posted by @bertsky in https://github.com/OCR-D/spec/pull/171/review_comment/create

EDIT term ~~benchmark~~ → evaluator

Aug 21 '20 10:08 bertsky

From a knowledge archaeology point of view it is good enough to have at least some OCR-results, rather than loosing a whole document with several thousand pages. One could even argue, that for really big pages (newspapers, maps, 2°-prints) it is preferable to save parts of a single page that were recognized well. Anyway, future OCR is expected to perform better than today's, so there will be an improvement over time.

I really like the idea of alternative workflows, but I'm afraid the Wizardry of Workflow Configuration (WWC) will grow in a manner unintelligibly by common mortals like me.

The first step, IMHO, is to extend the core-CLI in a way, that each processor reports its outcome in a unified way, so the core can decide what to do next, i.e.

skip the current workflow entity, which is physically represented by a whole page (binarization), a page region (segmentation) or even just a single line - just by not integrating it into the current filegroup
- if no input is left for the following processor, log this and exit gracefully
try a different processor (configured or based on different heuristics)

Sep 15 '20 06:09 M3ssman

One could even argue, that for really big pages (newspapers, maps, 2°-prints) it is preferable to save parts of a single page that were recognized well.

Yes, that's radical, but doable IMO. We could strive for line level result aggregation. We would need some way of marking partial failure (or "rejection") in our annotations – e.g. when layout segmentation failed to detect a text region or failed to meet a score across parts of the page – and optionally act on it – e.g. by (removign the failed segments and) running another segmentation processor on top of the partial result (which works as long as processors are annotating incrementally). But if we allowed partial failure then we cannot use binary CLI interfaces anymore (processor or test succeeds or fails)...

Some people have proposed standardizing log messages and then parsing these instead, but I would not recommend that approach at all. How about standardizing the processor/evaluator CLI's exit codes, so we can differentiate between all-out success (0), all-out failure (1) and partial failure (2), perhaps even temporary failure (3) (e.g. if GPU ran OOM)?

I really like the idea of alternative workflows, but I'm afraid the Wizardry of Workflow Configuration (WWC) will grow in a manner unintelligibly by common mortals like me.

Writing such generalized workflow configurations will become even more challenging, yes. But finding good/general workflows is already hard and requires expert knowledge and experimental data. We need not look at the divide between field experts and users as a drawback: it could also be called division of labour! The more work goes into a workflow configuration, the more versatile and usable it becomes, too. If users had to choose between 100 specialised, yet simple workflows (and perhaps to write the 101st by themselves), or just 2-3 general, but complex ("intelligent") workflows, then I think they would usually prefer the latter.

The first step, IMHO, is to extend the core-CLI in a way, that each processor reports its outcome in a unified way, so the core can decide what to do next, i.e.

skip the current workflow entity, which is physically represented by a whole page (binarization), a page region (segmentation) or even just a single line - just by not integrating it into the current filegroup

if no input is left for the following processor, log this and exit gracefully

try a different processor (configured or based on different heuristics)

I gave it some thought, and I do think this "skip strategy" is a valid choice – if done right.

Skipping a page must not mean to not create a page in the output file group though, but to at least pass a copy of the input annotation. (That's also the natural generalisation of empty segments within a page.) Thus, the workflow can still re-run the same step with other processors/parameters (as long as they are capable of incremental annotation), but it could also just follow up with the next step. The only reason forcing a re-run or necessitate cancelling the workflow entirely would now be if too many pages are empty or quality is just too bad on average.

And I do agree that core has partial responsibility for this behaviour, because it could call processors page by page, catching their exceptions and falling back to pass-through. See ~~https://github.com/OCR-D/core/issues/322#issuecomment-592775407~~ OCR-D/core#579 for my proposal on that.

Above that however, we are not talking about core's Python API here, but about the OCR-D workflow engine (whatever that will be). AFAICS workflow-configuration cannot ever be generalized to add re-processing (alternative paths), because make has no way to express conditional dependencies. Of course core's ocrd process could be extended with support for conditional workflow syntax (see #171) and parallelisation (perhaps even load balancing and network distribution). But (beyond CLI) we should allow other implementations with the same interface/functionality, too. Meaning, none of this looks like a "first step". We will likely build this up from shell scripts in the beginning.

Sep 16 '20 18:09 bertsky

spec spec copied to clipboard

additional CLI/METS/ocrd-tool.json specs for quality estimator tools and annotations

External quality control

OCR-D quality control

spec
spec copied to clipboard