spec icon indicating copy to clipboard operation
spec copied to clipboard

Validation of suspiciously small text regions

Open tboenig opened this issue 6 years ago • 5 comments

Validation of PAGE should make sure that there are no lines, points or very small regions, lines etc.

Any suggestions on realistic dimensions to raise a warning? Less than 10 pixels wide or high.

OCR-D/assets#28

tboenig avatar Jul 11 '19 14:07 tboenig

Yes, again, a case for OCR-D/core#252

bertsky avatar Jul 11 '19 16:07 bertsky

Wait, what are "suspiciously small" regions? Will this not get hairy fast with heuristics based on dimensions? What about e.g. thin separator lines or punctuation marks?

cneud avatar Jul 31 '19 22:07 cneud

I think what @tboenig meant was suspiciously small text regions (and lines).

And yes, that would have to depend on the DPI of the input, too.

And yes, it could still get hairy with single-region punctuation marks or page numbers like "I" – but too many warnings in the validator are still better than searching the complete haystack by hand, right? Perhaps geometry heuristics should differentiate between forbidden and suspicious?

bertsky avatar Jul 31 '19 22:07 bertsky

@bertsky Thanks, I've updated the titel accordingly. Anyway for all "validations" that are not directly related to violations of the PAGE schema I would expect a warning or suspicious flag rather than error or forbidden.

cneud avatar Aug 01 '19 11:08 cneud

@cneud Absolutely! This is not about the XML syntax, but about our (application-specific) semantic constraints. So maybe we should call this whole thing evaluation instead of validation, and have the report give a score instead of a boolean? (We could even offer different metrics for different situations, as in PRImA's layout evaluation profiles...)

bertsky avatar Aug 02 '19 07:08 bertsky