docling icon indicating copy to clipboard operation
docling copied to clipboard

Incorrect Reading Order in Single-page Image-Text Layouts

Open Bariskau opened this issue 1 year ago • 7 comments

Bug

There is an issue with the page reading order. Especially in single-page documents, the reading order of images on the left and text content on the right is not working as expected. This causes incorrect information mapping under images when converting to Markdown format.

Steps to reproduce

  1. Upload a single-page document
  2. Use a page layout with an image on the left and text on the right
  3. Convert the document to Markdown format
  4. Check the output

Expected Behavior:

Reading order should be: Page Title => Image 1 => Section-Header 1 => List Items => Image 2

Actual Behavior:

Reading order is incorrect: Page Title => Image 1 => Image 2 => Section-Header 1

Docling version

2.10.0

Python version

3.10

Sample layout

sample-layout

Bariskau avatar Dec 11 '24 10:12 Bariskau

@Bariskau what is the input format you were using? Is this a native Powerpoint, a PDF, or something else? If you provide the source file we could verify more easily.

cau-git avatar Dec 18 '24 11:12 cau-git

sample-cpu.pdf @cau-git I am using PDF format. I shared a sample layout as PDF. Thank you.

Bariskau avatar Dec 19 '24 01:12 Bariskau

I am also having same issue, is there any solution to solve this order problem?

mkhalid12 avatar Jan 04 '25 01:01 mkhalid12

LayoutReader (LayoutML) ordering works compatibly with DocLing. However, DocLing has limitations in obtaining line height and width values. Due to this technical limitation, dividing layout bounding boxes into random smaller bounding boxes and then ordering them with the model generally yields successful results.

However, there are two significant issues with this approach:

  1. LayoutML v3's license is not suitable for commercial use.
  2. Using a multi-modal pre-trained model to develop a model that only uses bounding box data is not an optimal approach.

Bariskau avatar Jan 04 '25 13:01 Bariskau

@Bariskau @mkhalid12 a revised reading order model is currently under development. We will post updates when we have them ready.

cau-git avatar Jan 31 '25 11:01 cau-git

@Bariskau @mkhalid12 You can track this PR: https://github.com/DS4SD/docling/pull/811

PeterStaar-IBM avatar Jan 31 '25 11:01 PeterStaar-IBM

@cau-git this is such a great news looking forward to it.

mkhalid12 avatar Jan 31 '25 11:01 mkhalid12