kraken
kraken copied to clipboard
Handle ALTO XML with different POINTS representation
ALTO XML from ABBYY Finereader separates x and y coordinates with comma:
<Polygon POINTS="159,837 2414,837 2414,1038 159,1038 159,837"/></Shape>
Replace commas by spaces to support that, too.
Signed-off-by: Stefan Weil [email protected]
See also https://github.com/altoxml/schema/issues/49.
ABBYY generated ALTO files neither have polygons nor baselines for textlines by default, but it is possible to find reasonable values for both using HPOS
, VPOS
, WIDTH
and HEIGHT
. The resulting code seems to work fine for "upgrading" of ABBYY OCR results with kraken. Would such changes be useful for the regular code, too? Then I can send a pull request.
As ALTO doesn't define valid and invalid representations I'd prefer having something that is able to parse all 4 examples proposed in the ticket. AFAIK PageXML suffers from the same problem so it's probably best to have a separate function extracting cartesian coordinate sequences from strings.