Missing text while parsing a PDF

Open pyRis opened this issue 9 months ago • 1 comments

Bug

In some documents, docling omits some text from the generated markdown, and a different subset of text when converting to txt file.

In the attached markdown and text files, please have a look at section 13, 14 of the text.

Steps to reproduce

For MD: docling InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing\ Agreement.pdf

For txt: docling InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing\ Agreement.pdf --to text

Docling version

Docling version: 2.26.0 Docling Core version: 2.22.0 Docling IBM Models version: 3.4.1 Docling Parse version: 3.4.0 Python: cpython-39 (3.9.20) Platform: Linux-6.13.7-arch1-1-x86_64-with-glibc2.41

Python version

Python 3.9.20

pdftotext (poppler) version

pdftotext version 25.03.0

Files related to the bug

InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.md

InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.txt

pdftotext_InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.txt

InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.pdf

Mar 22 '25 11:03 pyRis

@pyRis The main problem I can see is that a part between section 12.6 and 13.3.5 has been falsely detected as a table. Hence, the markdown contains the table but the plaintext version skips tables entirely (adding  as placeholder).

The section 14.1 is missing in both outputs. We will keep this sample on record for future testing, as we imrpove the layout detection model.

May 21 '25 12:05 cau-git