Konstantin Baierer comments

Results 297 comments of


                                            Konstantin Baierer

alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)

> I am not aware such examples, but maybe @cneud can help with this? * https://github.com/kba/ocr-fileformat-samples/blob/master/samples/alto/2.0/417576986_0078.alto * https://github.com/kba/ocr-fileformat-samples/blob/master/samples/alto/2.0/417576986_0012.alto (ht @bertsky in gitter)

alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)

> @kba I do not see any content in the margin elements - there will be no output produced by the transformation. Ah, I didn't realize. Maybe @cneud has ALTO...

Cannot convert hOCR with xhtml namespace to ALTO 2.1

The XSLT scripts use local-name only, non-namespaced, c.f. https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.1.xsl. I think I ran into this before https://github.com/filak/hOCR-to-ALTO/commit/9f8026cd2b61bd842aa40dff5598f2d0bbd19b07 .

PAGE format extension in Transkribus

Yes, something like `ocr-transform transkribus-page page2019` for ocr{d_,-}fileformat will be next steps to properly integrate this into OCR-D.

Support DjVu format?

@jwilk has done a lot of work on OCR with DjVu in digitization in Poland IIUC. Not sure how widely DjVu is used, it's certainly an interesting and fitting format...

ABBYY2Alto

Yes, I've seen it but I very much prefer a declarative transformation in XSLT that has no possible side effects and is easier to test. Maybe we can convert it...

ABBYY2Alto

How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth

ABBYY2Alto

> > There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto > > That source code includes at least one copyrighted xsl file. It does? I only saw that...

Simplify validations

> whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version I would leave that option and optionally automate....

Order of regions

> I double checked and your are right: the output of segmentation and recognition are showing the right order but "fileformat-transform" extracts the paragraphs in wrong order. > I believe...