eynollah
eynollah copied to clipboard
Order of regions
The order of text regions detected by eynollah is not right. When running eynollah-segment for the attached image the text regions are presented in wrong order.
The workflow used is: "olena-binarize -I OCR-D-IMG -O OCR-D-BIN" "eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models default -P curved_line true" "tesserocr-recognize -I OCR-D-SEG -O OCR-D-OCR-TESSEROCR -P model ecco" 'fileformat-transform -I OCR-D-OCR-TESSEROCR -O OCR-D-TEXT -P from-to "page text"'
Dear @AriVesalainen ,
Based on your result, this is the reading order. But I couldn't detect the mistake with the reading order. Can you explain in detail what is wrong with reading order?
I double checked and your are right: the output of segmentation and recognition are showing the right order but "fileformat-transform" extracts the paragraphs in wrong order.
I believe this is due to #22 – so in essence, the representation in eynollah is consistent with PageViewer, but wrong w.r.t. PAGE-XML (and thus also XSL transformations).
I double checked and your are right: the output of segmentation and recognition are showing the right order but "fileformat-transform" extracts the paragraphs in wrong order.
I believe this is due to #22 – so in essence, the representation in eynollah is consistent with PageViewer, but wrong w.r.t. PAGE-XML (and thus also XSL transformations).
ocrd_fileformat should use https://github.com/kba/page-to-alto for the PAGE transformation now, not XSLT and should respect PAGE-XML reading order. I think we need a new release for ocrd_fileformat and ocrd_all.
ocrd_fileformat should use https://github.com/kba/page-to-alto for the PAGE transformation now, not XSLT and should respect PAGE-XML reading order. I think we need a new release for ocrd_fileformat and ocrd_all.
Ah, sry, was not aware of that. But still: outside of PRImA core libs and PageViewer and PageConverter and eynollah, we have to stick with the PAGE-XML spec, which requires using @index instead of XML ordering. And that's also what OCR-D and thus page-to-alto does.
So IMO this is still a duplicate of #22. (IIRC the actual blocker is that we have no respone on https://github.com/PRImA-Research-Lab/prima-core-libs/issues/13 yet.)
Ah, sry, was not aware of that. But still: outside of PRImA core libs and PageViewer and PageConverter and eynollah, we have to stick with the PAGE-XML spec, which requires using
@indexinstead of XML ordering. And that's also what OCR-D and thus page-to-alto does.
I was not aware that eynollah has already fixed #22 in the meantime by sorting on @index before serialization. (This is enough to make both PageViewer and OCR-D happy.)
Also, @kba I misread your should as you believed it to be that way already, instead of you calling for action to make ocrd_filetransform start using page-to-alto (which I fully support in light of this, as implementing @index sorting would be hard to do with XSLT).