Missing text while parsing a PDF
Bug
In some documents, docling omits some text from the generated markdown, and a different subset of text when converting to txt file.
In the attached markdown and text files, please have a look at section 13, 14 of the text.
Steps to reproduce
For MD: docling InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing\ Agreement.pdf
For txt: docling InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing\ Agreement.pdf --to text
Docling version
Docling version: 2.26.0 Docling Core version: 2.22.0 Docling IBM Models version: 3.4.1 Docling Parse version: 3.4.0 Python: cpython-39 (3.9.20) Platform: Linux-6.13.7-arch1-1-x86_64-with-glibc2.41
Python version
Python 3.9.20
pdftotext (poppler) version
pdftotext version 25.03.0
Files related to the bug
InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.md
InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.txt
pdftotext_InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.txt
InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10.9_Manufacturing Agreement.pdf
@pyRis The main problem I can see is that a part between section 12.6 and 13.3.5 has been falsely detected as a table. Hence, the markdown contains the table but the plaintext version skips tables entirely (adding <!-- missing-text --> as placeholder).
The section 14.1 is missing in both outputs. We will keep this sample on record for future testing, as we imrpove the layout detection model.