kraken icon indicating copy to clipboard operation
kraken copied to clipboard

Handle ALTO XML with different POINTS representation

Open stweil opened this issue 1 year ago • 3 comments

ALTO XML from ABBYY Finereader separates x and y coordinates with comma:

<Polygon POINTS="159,837 2414,837 2414,1038 159,1038 159,837"/></Shape>

Replace commas by spaces to support that, too.

Signed-off-by: Stefan Weil [email protected]

stweil avatar Sep 06 '22 14:09 stweil

See also https://github.com/altoxml/schema/issues/49.

stweil avatar Sep 06 '22 14:09 stweil

ABBYY generated ALTO files neither have polygons nor baselines for textlines by default, but it is possible to find reasonable values for both using HPOS, VPOS, WIDTH and HEIGHT. The resulting code seems to work fine for "upgrading" of ABBYY OCR results with kraken. Would such changes be useful for the regular code, too? Then I can send a pull request.

stweil avatar Sep 06 '22 14:09 stweil

As ALTO doesn't define valid and invalid representations I'd prefer having something that is able to parse all 4 examples proposed in the ticket. AFAIK PageXML suffers from the same problem so it's probably best to have a separate function extracting cartesian coordinate sequences from strings.

mittagessen avatar Sep 28 '22 18:09 mittagessen