Mike Gerber comments

Results 169 comments of


                                            Mike Gerber

"ocr-transform page alto ... ...": loosing text

ocrd_calamari (but AFAIK *not* Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up...

"ocr-transform page alto ... ...": loosing text

> Indeed, PAGE-ALTO conversion requires word segmentation. I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason....

"ocr-transform page alto ... ...": loosing text

What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion...

page__text.xsl is not honoring the reading order

I would consider this a serious bug, not an enhancement.

page__text.xsl is not honoring the reading order

It's not imperfection by not supporting some features, it's producing a **wrong result** if it's not honoring the reading order, for a lot of real world PAGE XML files.

page__text.xsl is not honoring the reading order

The file in https://github.com/UB-Mannheim/ocr-fileformat/issues/138#issue-895785528 was created (by a SBB contractor) using Aletheia and uses their encoding scheme, which uses a lot of PUA characters, which in part is based on...

ABBYY2Alto

> How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth I had problems with prima-page-converter (going to open a bug report), while [Mewel/abbyy-to-alto](https://github.com/Mewel/abbyy-to-alto) worked right away.