docling icon indicating copy to clipboard operation
docling copied to clipboard

Ignoring Images When Converting from PDF to MD

Open sallahbaksh opened this issue 11 months ago • 4 comments

Question

Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.

sallahbaksh avatar Jan 23 '25 18:01 sallahbaksh

Can you give us an example?

PeterStaar-IBM avatar Jan 26 '25 07:01 PeterStaar-IBM

I've attached a pdf that takes over an hour to convert from pdf to md: Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf

sallahbaksh avatar Jan 27 '25 16:01 sallahbaksh

@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).

I think that with this example, we can robustify the layout model. Let us work on that!

PeterStaar-IBM avatar Jan 28 '25 06:01 PeterStaar-IBM

@sallahbaksh I tried converting the sample document, and it indeed takes a long time. However, it comes out fine in the end. The reason I suspect is that the high frequency of tables causes the slow-down, since table-structure inference is expensive, and longer runtimes must be expected. You can verify if you get fast results by disabling the tables . The included images should have no effect on the speed.

cau-git avatar Jan 31 '25 11:01 cau-git

Is there an option to ignore images when converting documents to MD? We're trying to convert documents that contain logos and we're trying to avoid converting images when converting a document to markdown.

kany avatar Oct 23 '25 12:10 kany