Incorrect Reading Order in Single-page Image-Text Layouts
Bug
There is an issue with the page reading order. Especially in single-page documents, the reading order of images on the left and text content on the right is not working as expected. This causes incorrect information mapping under images when converting to Markdown format.
Steps to reproduce
- Upload a single-page document
- Use a page layout with an image on the left and text on the right
- Convert the document to Markdown format
- Check the output
Expected Behavior:
Reading order should be: Page Title => Image 1 => Section-Header 1 => List Items => Image 2
Actual Behavior:
Reading order is incorrect: Page Title => Image 1 => Image 2 => Section-Header 1
Docling version
2.10.0
Python version
3.10
Sample layout
@Bariskau what is the input format you were using? Is this a native Powerpoint, a PDF, or something else? If you provide the source file we could verify more easily.
sample-cpu.pdf @cau-git I am using PDF format. I shared a sample layout as PDF. Thank you.
I am also having same issue, is there any solution to solve this order problem?
LayoutReader (LayoutML) ordering works compatibly with DocLing. However, DocLing has limitations in obtaining line height and width values. Due to this technical limitation, dividing layout bounding boxes into random smaller bounding boxes and then ordering them with the model generally yields successful results.
However, there are two significant issues with this approach:
- LayoutML v3's license is not suitable for commercial use.
- Using a multi-modal pre-trained model to develop a model that only uses bounding box data is not an optimal approach.
@Bariskau @mkhalid12 a revised reading order model is currently under development. We will post updates when we have them ready.
@Bariskau @mkhalid12 You can track this PR: https://github.com/DS4SD/docling/pull/811
@cau-git this is such a great news looking forward to it.