docling icon indicating copy to clipboard operation
docling copied to clipboard

Apparently simple pdf file totally destroyed by docling

Open caa24 opened this issue 7 months ago • 0 comments

Bug

The markdown file obtained as output from the conversion has some serious problems: there are ample white areas, and the text is sometimes mixed up compared to how it appears. I can't understand why this happens. It could be that the PDF file is really strange internally. However, the most puzzling thing is that this also happens when using the VLM pipeline with granite3.2-vision:2b. I thought that in this case, any internal issues with the file would be completely ignored, since — as I understand it — each page is treated as an image. ...

Steps to reproduce

Just try the standard (or even the VLM) pipeline on this public pdf file : https://database.ich.org/sites/default/files/ICH_Q14_Document_Step2_Guideline_2022_0324.pdf ...

Docling version

Docling version: 2.32.0 Docling Core version: 2.31.0 Docling IBM Models version: 3.4.3 Docling Parse version: 4.0.1 Python: cpython-310 (3.10.6) Platform: Windows-10-10.0.26100-SP0 ...

Python version

3.10.6

caa24 avatar May 23 '25 19:05 caa24