spec More specifics on AlternativeImage processing

The spec should be more specific about how AlternativeImage must be used. There are issues of coordinate reproducibility and disambiguation, and we need another @comments class rescaled. See here for a full description. (I don't want to transfer the issue here.)

Jun 26 '19 13:06 bertsky

The above mentioned issue in ocrd_tesserocr has been closed now (because a first proof-of-concept implementation has been merged there), but the discussion of the open problems, and of adding detail to the spec should be continued.

@wrznr Do you think I should copy my argumentation here, or can this be continued in the closed issue?

Jul 04 '19 11:07 bertsky

@bertsky Neither. We should continue the discussion here as soon as @cneud and @kba are available. But there is no need for copying.

Jul 04 '19 12:07 wrznr

Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to OCR-D-IMG-BIN, and another does too, there can be conflicts – especially with parallel execution, but even with sequential execution.

Thus, these file groups must be made explicit so they can be validated (in the usual way).

Aug 28 '19 19:08 bertsky

Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to OCR-D-IMG-BIN, and another does too, there can be conflicts – especially with parallel execution, but even with sequential execution.

This off-topic has since re-appeared as off-topic in another issue, and it was decided to indeed change the specification to place all new derived images in the output fileGrp (along the PAGE-XML).

However, the above original issue remains pressing: we have to reflect the problem of AlternativeImage coordinate consistency and its solution in core within the spec.

This would entail:

stating the general problem briefly (to clearly motivate these strict and elaborate requirements), but detailing aspects like reshaping during rotation, center of rotation, multi-level rotation, splitting @orientation into reflection vs rotation
upgrading the AlternativeImage/@comments classes ("image features") to mandatory
extending them appropriately to all needed features
explaining their interpretation in detail (including the difference between level-local features like deskewed and inherited features like binarized)
adding the principle of appending to AlternativeImage (not replacing/inserting), and appending its @comments (not starting empty but keeping all features of the image data it was derived from and then adding the new features)
addressing/discussing the open problems of down/up-scaling and dewarping

Jun 23 '20 22:06 bertsky

Yet another open problem has surfaced: When a processor changes coordinates of some existing segment, it must also remove all existing derived images for that segment, because they will be invalid. This includes the following cases:

whenever overwriting a Page's Border, or a Border's Coords or Coords/@points, remove all the Page's derived images with cropped,
whenever overwriting Region's or TextLine's or Word's or Glyph's Coords or Coords/@points, remove all its derived images,
whenever overwriting Page's or Region's @orientation, remove all its derived images with deskewed.

But this may also remove the only result of some previous workflow step (like binarization or denoising). So workflows would need to re-do them afterwards, and workflow writers must be aware of this. To that end, consuming processors should ideally fail immediately when they cannot find the derived images they expected. But that only happens when the implementors chose to use image feature selection/filtering (which is still optional).

For modules like ocrd_tesserocr it's not easy to decide which image features should be present. Some workflows might want to include a binarization step, while others might want to use Tesseract's internal binarization. Also, at the processor level, it cannot be decided whether denoised is a required feature. (But still, if the workflow included a denoising step earlier it would be quite surprising if no denoised images are available after re-cropping, for example.)

Oct 31 '20 00:10 bertsky

spec spec copied to clipboard

More specifics on AlternativeImage processing

spec
spec copied to clipboard