Mike Gerber

Results 74 comments of Mike Gerber

ocrd_calamari (but AFAIK *not* Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up...

> Indeed, PAGE-ALTO conversion requires word segmentation. I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason....

What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion...

I would consider this a serious bug, not an enhancement.

It's not imperfection by not supporting some features, it's producing a **wrong result** if it's not honoring the reading order, for a lot of real world PAGE XML files.

The file in https://github.com/UB-Mannheim/ocr-fileformat/issues/138#issue-895785528 was created (by a SBB contractor) using Aletheia and uses their encoding scheme, which uses a lot of PUA characters, which in part is based on...

> How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth I had problems with prima-page-converter (going to open a bug report), while [Mewel/abbyy-to-alto](https://github.com/Mewel/abbyy-to-alto) worked right away.

> I had problems with prima-page-converter (going to open a bug report), https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/24 - I opened the issue against prima-page-viewer as it is affected, too.

> while [Mewel/abbyy-to-alto](https://github.com/Mewel/abbyy-to-alto) worked right away. Sort of - it does not produce `Processing` tags (or the ALTO v2 equivalent), so it is lacking too.

> > > There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto > > That source code includes at least one copyrighted xsl file. > It does? I only...