Validation of suspiciously small text regions
Validation of PAGE should make sure that there are no lines, points or very small regions, lines etc.
Any suggestions on realistic dimensions to raise a warning? Less than 10 pixels wide or high.
OCR-D/assets#28
Yes, again, a case for OCR-D/core#252
Wait, what are "suspiciously small" regions? Will this not get hairy fast with heuristics based on dimensions? What about e.g. thin separator lines or punctuation marks?
I think what @tboenig meant was suspiciously small text regions (and lines).
And yes, that would have to depend on the DPI of the input, too.
And yes, it could still get hairy with single-region punctuation marks or page numbers like "I" – but too many warnings in the validator are still better than searching the complete haystack by hand, right? Perhaps geometry heuristics should differentiate between forbidden and suspicious?
@bertsky Thanks, I've updated the titel accordingly. Anyway for all "validations" that are not directly related to violations of the PAGE schema I would expect a warning or suspicious flag rather than error or forbidden.
@cneud Absolutely! This is not about the XML syntax, but about our (application-specific) semantic constraints. So maybe we should call this whole thing evaluation instead of validation, and have the report give a score instead of a boolean? (We could even offer different metrics for different situations, as in PRImA's layout evaluation profiles...)