ocrd_segment icon indicating copy to clipboard operation
ocrd_segment copied to clipboard

plausibilize and sanitize are too broad terms

Open mikegerber opened this issue 6 years ago • 5 comments

ocrd-segment-repair has the optional operations "plausibilize" and "sanitize" – I have no idea what this exactly does :) I would prefer something like this:

  • shrink-regions-to-hull-of-lines
  • whatever-plausibilize-does

There seems to also be another thing ocrd-segment-repair does.

In other words: Make operations explicit.

mikegerber avatar Nov 26 '19 12:11 mikegerber

ocrd-segment-repair has the optional operations "plausibilize" and "sanitize" – I have no idea what this exactly does :)

I agree, these are not expressive enough, or even memorable (which is what...)

I would prefer something like this:

* shrink-regions-to-hull-of-lines

...or just shrink-regions?

* whatever-plausibilize-does

ATM all it does is remove regions fully contained by others or nearly equal to them (and fix the ReadingOrder afterwards).

It's intended to become much more though, like merging or shrinking overlapping neighbouring regions, or fixing reading order via basic heuristics (e.g. no arbitrary jumps back and forth).

Since this processor started out under the name repair but received a default behaviour of just warning about likely errors, we needed some verb for the actual action.

Maybe separate-neighbours?

@wrznr?

bertsky avatar Nov 26 '19 13:11 bertsky

Right, they have very common names since they are intended to do various things. Right now, they do not do very much and are not ready for productive use or even testing. I would rather keep the current names and see what the processors will become. Let us discuss about a proper name when implementation and documentation are finished. (ocrd_segment will be my main focus in December)

wrznr avatar Nov 26 '19 13:11 wrznr

Related: qurator-spk/ocrd_repair_inconsistencies#2

mikegerber avatar Nov 26 '19 16:11 mikegerber

Documentation from https://ocr-d.de/en/workflows:

  • plausibilize = Remove redundant (almost equal or almost contained) regions, and merge overlapping regions
  • sanitize = Shrink and/or expand a region in such a way that it coordinates include those of all its lines

mikegerber avatar Oct 16 '20 16:10 mikegerber

Documentation from https://ocr-d.de/en/workflows:

  • plausibilize = Remove redundant (almost equal or almost contained) regions, and merge overlapping regions
  • sanitize = Shrink and/or expand a region in such a way that it coordinates include those of all its lines

This is actually from the ocrd-tool json description of these parameters, see ocrd-segment-repair -h

bertsky avatar Oct 16 '20 17:10 bertsky