spec
spec copied to clipboard
More specifics on AlternativeImage processing
The spec should be more specific about how AlternativeImage must be used. There are issues of coordinate reproducibility and disambiguation, and we need another @comments class rescaled. See here for a full description. (I don't want to transfer the issue here.)
The above mentioned issue in ocrd_tesserocr has been closed now (because a first proof-of-concept implementation has been merged there), but the discussion of the open problems, and of adding detail to the spec should be continued.
@wrznr Do you think I should copy my argumentation here, or can this be continued in the closed issue?
@bertsky Neither. We should continue the discussion here as soon as @cneud and @kba are available. But there is no need for copying.
Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to OCR-D-IMG-BIN, and another does too, there can be conflicts – especially with parallel execution, but even with sequential execution.
Thus, these file groups must be made explicit so they can be validated (in the usual way).
Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to
OCR-D-IMG-BIN, and another does too, there can be conflicts – especially with parallel execution, but even with sequential execution.
This off-topic has since re-appeared as off-topic in another issue, and it was decided to indeed change the specification to place all new derived images in the output fileGrp (along the PAGE-XML).
However, the above original issue remains pressing: we have to reflect the problem of AlternativeImage coordinate consistency and its solution in core within the spec.
This would entail:
- stating the general problem briefly (to clearly motivate these strict and elaborate requirements), but detailing aspects like reshaping during rotation, center of rotation, multi-level rotation, splitting
@orientationinto reflection vs rotation - upgrading the
AlternativeImage/@commentsclasses ("image features") to mandatory - extending them appropriately to all needed features
- explaining their interpretation in detail (including the difference between level-local features like
deskewedand inherited features likebinarized) - adding the principle of appending to
AlternativeImage(not replacing/inserting), and appending its@comments(not starting empty but keeping all features of the image data it was derived from and then adding the new features) - addressing/discussing the open problems of down/up-scaling and dewarping
Yet another open problem has surfaced: When a processor changes coordinates of some existing segment, it must also remove all existing derived images for that segment, because they will be invalid. This includes the following cases:
- whenever overwriting a
Page'sBorder, or aBorder'sCoordsorCoords/@points, remove all thePage's derived images withcropped, - whenever overwriting
Region's orTextLine's orWord's orGlyph'sCoordsorCoords/@points, remove all its derived images, - whenever overwriting
Page's orRegion's@orientation, remove all its derived images withdeskewed.
But this may also remove the only result of some previous workflow step (like binarization or denoising). So workflows would need to re-do them afterwards, and workflow writers must be aware of this. To that end, consuming processors should ideally fail immediately when they cannot find the derived images they expected. But that only happens when the implementors chose to use image feature selection/filtering (which is still optional).
For modules like ocrd_tesserocr it's not easy to decide which image features should be present. Some workflows might want to include a binarization step, while others might want to use Tesseract's internal binarization. Also, at the processor level, it cannot be decided whether denoised is a required feature. (But still, if the workflow included a denoising step earlier it would be quite surprising if no denoised images are available after re-cropping, for example.)