Ignoring Images When Converting from PDF to MD
Question
Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.
Can you give us an example?
I've attached a pdf that takes over an hour to convert from pdf to md: Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf
@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).
I think that with this example, we can robustify the layout model. Let us work on that!
@sallahbaksh I tried converting the sample document, and it indeed takes a long time. However, it comes out fine in the end. The reason I suspect is that the high frequency of tables causes the slow-down, since table-structure inference is expensive, and longer runtimes must be expected. You can verify if you get fast results by disabling the tables . The included images should have no effect on the speed.
Is there an option to ignore images when converting documents to MD? We're trying to convert documents that contain logos and we're trying to avoid converting images when converting a document to markdown.