Konstantin Baierer

Results 297 comments of Konstantin Baierer

> I am not aware such examples, but maybe @cneud can help with this? * https://github.com/kba/ocr-fileformat-samples/blob/master/samples/alto/2.0/417576986_0078.alto * https://github.com/kba/ocr-fileformat-samples/blob/master/samples/alto/2.0/417576986_0012.alto (ht @bertsky in gitter)

> @kba I do not see any content in the margin elements - there will be no output produced by the transformation. Ah, I didn't realize. Maybe @cneud has ALTO...

The XSLT scripts use local-name only, non-namespaced, c.f. https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.1.xsl. I think I ran into this before https://github.com/filak/hOCR-to-ALTO/commit/9f8026cd2b61bd842aa40dff5598f2d0bbd19b07 .

Yes, something like `ocr-transform transkribus-page page2019` for ocr{d_,-}fileformat will be next steps to properly integrate this into OCR-D.

@jwilk has done a lot of work on OCR with DjVu in digitization in Poland IIUC. Not sure how widely DjVu is used, it's certainly an interesting and fitting format...

Yes, I've seen it but I very much prefer a declarative transformation in XSLT that has no possible side effects and is easier to test. Maybe we can convert it...

How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth

> > There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto > > That source code includes at least one copyrighted xsl file. It does? I only saw that...

> whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version I would leave that option and optionally automate....

> I double checked and your are right: the output of segmentation and recognition are showing the right order but "fileformat-transform" extracts the paragraphs in wrong order. > I believe...