core
core copied to clipboard
RFC: A standard --validate option
Another idea that came up in https://github.com/OCR-D/ocrd_olena/issues/60: I routinely run validation after running each processor to catch problems early. If there was a standard option --validate in core (supplemented by a config file that configures e.g. --skip options), this pattern:
ocrd-olena-binarize --overwrite -I $INPUT_FILE_GRP -O OCR-D-IMG-BINPAGE,OCR-D-IMG-BIN -P impl sauvola-ms-split
ocrd workspace validate $validate_options
ocrd-sbb-textline-detector --overwrite -I OCR-D-IMG-BINPAGE -O OCR-D-SEG-LINE -P model /var/lib/textline_detection
ocrd workspace validate $validate_options
would simplify to:
ocrd-olena-binarize --validate --overwrite -I $INPUT_FILE_GRP -O OCR-D-IMG-BINPAGE,OCR-D-IMG-BIN -P impl sauvola-ms-split
ocrd-sbb-textline-detector --validate --overwrite -I OCR-D-IMG-BINPAGE -O OCR-D-SEG-LINE -P model /var/lib/textline_detection
I think some kind of configuration for this hypothetical option is absolutely required. For example, I use these options to make routine validation useful for me:
--skip dimension
--skip pixel_density
--page-strictness lax
--page-coordinate-consistency off
See also #557 for ideas in regard to configuration.
That's exactly what I have already proposed here a while ago:
Perhaps we should start adding other mechanisms that affect all processors equally (like the loglevel override): ... Or supporting automatic workspace validation with different levels/sets of checks.