eynollah
eynollah copied to clipboard
reading order representation (XML order vs index)
The reading order detection capabilities of eynollah look pretty amazing AFAICS – if viewed through the glasses of PageViewer. But it is noteworthy that the actual representation for PAGE-XML does not correspond to that schema's documentation regarding @index.
It surprisingly turns out that PageViewer gets it wrong too. See here for full report.
So IMO eyhollah needs to actually inverse its representation: the (currently correct) XML ordering needs to also become the (currently broken) @index ordering.
Dear Robert, first let me thank you for your nice words. We already had the same issue with Page Viewer and another tool which was used by @cneud (I've forgotten the name). But the point is that by eynollah we can find the reading order as Page Viewer shows that or better to say regardless of viewers we know the orders of text regions in the right manner. So we can actually have a call (including @cneud , @kba and @mikegerber) to discuss about this and to see how we can write it into output in order to get desired results.
While I think "Position (order number) of this item within the current hierarchy level. " (from PAGE-XML's schema) could be clearer, I, too, think the implementation in both PAGE Viewer and in eynollah is currently not according to this spec.
In @bertsky's example (https://github.com/PRImA-Research-Lab/prima-core-libs/files/6046579/debug-readingorder.zip) the correct reading order is the same as the XML order, but the index attributes of the RegionRefIndexed elements are all over the place. (They are set in xml.py from counting the Regions in the order they appear in the result from some other process, I think, which is basically arbitrary. But I'm not 100% sure as the function surprisingly returns id_of_marginalia and I cannot make much sense of that. @vahidrezanezhad and @kba should probably write something about this because all this stuff is currently 🚧🚧🚧)
Anyways, the index values should be sorted the same as the current XML order, then it would be correct. And it would still display correctly in the current PAGE Viewer with the same allegedly incorrect use of the reading order
(Is there any good reason to order XML elements by a special index attribute instead of just using the XML order? For stream rewriting maybe?)
@vahidrezanezhad Any news on this bug?
@vahidrezanezhad Any news on this bug?
not yet :(