docling Ignoring Images When Converting from PDF to MD

Question

Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.

Jan 23 '25 18:01 sallahbaksh

Can you give us an example?

Jan 26 '25 07:01 PeterStaar-IBM

I've attached a pdf that takes over an hour to convert from pdf to md: Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf

Jan 27 '25 16:01 sallahbaksh

@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).

I think that with this example, we can robustify the layout model. Let us work on that!

Jan 28 '25 06:01 PeterStaar-IBM

@sallahbaksh I tried converting the sample document, and it indeed takes a long time. However, it comes out fine in the end. The reason I suspect is that the high frequency of tables causes the slow-down, since table-structure inference is expensive, and longer runtimes must be expected. You can verify if you get fast results by disabling the tables . The included images should have no effect on the speed.

Jan 31 '25 11:01 cau-git

Is there an option to ignore images when converting documents to MD? We're trying to convert documents that contain logos and we're trying to avoid converting images when converting a document to markdown.

Oct 23 '25 12:10 kany