Mike Gerber
Mike Gerber
ocrd_calamari (but AFAIK *not* Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up...
> Indeed, PAGE-ALTO conversion requires word segmentation. I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason....
What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion...
I would consider this a serious bug, not an enhancement.
It's not imperfection by not supporting some features, it's producing a **wrong result** if it's not honoring the reading order, for a lot of real world PAGE XML files.
The file in https://github.com/UB-Mannheim/ocr-fileformat/issues/138#issue-895785528 was created (by a SBB contractor) using Aletheia and uses their encoding scheme, which uses a lot of PUA characters, which in part is based on...
> How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth I had problems with prima-page-converter (going to open a bug report), while [Mewel/abbyy-to-alto](https://github.com/Mewel/abbyy-to-alto) worked right away.
> I had problems with prima-page-converter (going to open a bug report), https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/24 - I opened the issue against prima-page-viewer as it is affected, too.
> while [Mewel/abbyy-to-alto](https://github.com/Mewel/abbyy-to-alto) worked right away. Sort of - it does not produce `Processing` tags (or the ALTO v2 equivalent), so it is lacking too.
> > > There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto > > That source code includes at least one copyrighted xsl file. > It does? I only...