spec icon indicating copy to clipboard operation
spec copied to clipboard

add image preprocessing steps

Open bertsky opened this issue 5 years ago • 5 comments
trafficstars

IMO there is a large, still unmet demand in OCR-D for image preprocessing tools to

  1. color-normalize raw images (i.e. linear or non-linear contrast stretching, gamma correction)
  2. denoise raw images (i.e. luminance/grayscale or color denoising before binarization)

Most binarization algorithms depend on this. For example, Sauvola (unless it exposes the R parameter and one can estimate a good fit from the image dynamics) assumes full dynamic range.

So how about adding the following:

  • in the METS specs, new fileGrp/@USE name recommendations OCR-D-IMG-NORM and OCR-D-IMG-RAWDEN
  • in the PAGE specs, new AlternativeImage/@comments classes normalized and raw-denoised
  • in the ocrd-tool schema, new tool/steps enum types preprocessing/optimization/normalization (which is different from grayscale_normalization) and preprocessing/optimization/raw-denoising (which is different from binary despeckling)

bertsky avatar Jun 10 '20 07:06 bertsky

* in the METS specs, new `fileGrp/@USE` name recommendations `OCR-D-IMG-NORM` and `OCR-D-IMG-RAWDEN`

should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN

* in the PAGE specs, new `AlternativeImage/@comments` classes `normalized` and `raw-denoised`

Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...

bertsky avatar Jul 15 '20 11:07 bertsky

should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN

:+1:

Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...

IMHO "raw denoising" is clearer than distinguishing despeckling/denoising. Then again, our glossary currently defines despeckling as

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

And "denoise" is not introduced at all. So, we're free to define it as you proposed. @EEngl52 any objection?

kba avatar Jul 15 '20 13:07 kba

Then again, our glossary currently defines despeckling as

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

Oh, but these physical artifacts cannot be reliably removed after binarization IMHO. You need special detectors on raw colors. So if that's the term OCR-D (or the OCR community in general) has agreed upon, let's stick to that, and not project any other interpretation. In that sense I think we still have no despeckling processors yet.

And "denoise" is not introduced at all.

Then let's define it! Let's also differentiate between raw and bilevel denoising.

bertsky avatar Jul 15 '20 13:07 bertsky

IMO we could differentiate denoising/despeckling. But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising in a workflow step called despeckling. So it would probably be easier to go with @bertsky 's last suggestion on raw and bilevel denoising and to actually define denoising in the glossary

EEngl52 avatar Jul 15 '20 13:07 EEngl52

But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising in a workflow step called despeckling.

Absolutely. Since despeckling was all we had, the current denoising processors all use that (in @comments and tool json):

  • ocrd_cis: ocrd-cis-ocropy-denoise, ocrd-cis-ocropy-binarize
  • ocrd_wrap: ocrd-skimage-denoise, ocrd-skimage-denoise-raw

We should open respective issues in those repos, and in the workflow guide of course.

And "denoise" is not introduced at all.

Then let's define it! Let's also differentiate between raw and bilevel denoising.

So how about:

  • in the PAGE specs, new AlternativeImage/@comments classes normalized and denoised
    (IMO there's no need for a raw-denoised, since we now require ordering anyway, so we should see things like denoised,binarized,denoised)
  • in the ocrd-tool schema, new tool/steps enum types preprocessing/optimization/normalization (which is different from grayscale_normalization), preprocessing/optimization/raw-denoising (which is different from despeckling) and preprocessing/optimization/binary-denoising

bertsky avatar Jul 15 '20 13:07 bertsky