docling icon indicating copy to clipboard operation
docling copied to clipboard

Missing text while parsing a PDF

Open pyRis opened this issue 9 months ago • 1 comments

Bug

In some documents, docling omits some text from the generated markdown, and a different subset of text when converting to txt file.

In the attached markdown and text files, please have a look at section 13, 14 of the text.

Steps to reproduce

For MD: docling InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing\ Agreement.pdf

For txt: docling InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing\ Agreement.pdf --to text

Docling version

Docling version: 2.26.0 Docling Core version: 2.22.0 Docling IBM Models version: 3.4.1 Docling Parse version: 3.4.0 Python: cpython-39 (3.9.20) Platform: Linux-6.13.7-arch1-1-x86_64-with-glibc2.41

Python version

Python 3.9.20

pdftotext (poppler) version

pdftotext version 25.03.0

Files related to the bug

InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.md

InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.txt

pdftotext_InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.txt

InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.pdf

pyRis avatar Mar 22 '25 11:03 pyRis

@pyRis The main problem I can see is that a part between section 12.6 and 13.3.5 has been falsely detected as a table. Hence, the markdown contains the table but the plaintext version skips tables entirely (adding <!-- missing-text --> as placeholder).

The section 14.1 is missing in both outputs. We will keep this sample on record for future testing, as we imrpove the layout detection model.

cau-git avatar May 21 '25 12:05 cau-git