markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Create advanced PDF convertor

Open gagb opened this issue 11 months ago • 3 comments
trafficstars

@gagb Would be great to have this as an example in the README! Thanks.

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Exactly this. We could use something like PyMuPDF4LLM to extract images and classify them with VLM later, but if this could be done in one step during pdf->md conversion, that would be brilliant!

Originally posted by @vaclcer in #12

Either extend the existing one or create a new one. This is a frequent use case.

gagb avatar Dec 18 '24 19:12 gagb

@gagb Would be great to have this as an example in the README! Thanks.

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Exactly this. We could use something like PyMuPDF4LLM to extract images and classify them with VLM later, but if this could be done in one step during pdf->md conversion, that would be brilliant!

Originally posted by @vaclcer in #12

Either extend the existing one or create a new one. This is a frequent use case.

I have an open source project that I implemented PDF vector drawing reconstruction for the purpose of ML-based analysis. I would be willing to contribute it.

It uses a DFS to combine nearby vector drawing bounding boxes to create the maximally connected bbox. This bounding box can then be used to create a pymupdf pixmap and extract what is essentially a screenshot of the vector art.

Tables in PDFs are usually a collection of distinct vector drawing lines and rectangles, not actual images.

See: https://github.com/FEPrep/parser/blob/main/parser/drawing/drawing.py

LockedThread avatar Dec 18 '24 20:12 LockedThread

I think PyMuPDF4LLM is a good option, but I believe there might be licensing issues. Currently, markitdown is under the MIT license, so I think it’s better to avoid integrating libraries with more restrictive licenses than MIT.

Similarly, libraries that require high machine specifications might also face limitations, as there could be situations where they cannot be used effectively. I think that extracting information from PDFs, like extracting images or audio, would be better handled using APIs provided by LLMs, such as those from OpenAI.

For example, as follows: Document-Intelligence-with-LLM

What do you think?

HawkClaws avatar Dec 20 '24 05:12 HawkClaws

We have created a library of Markdownization that we want to structure with good accuracy without using LLM! What do you think? It is also MIT licensed!

https://github.com/HawkClaws/pdf2markdown4llm

HawkClaws avatar Jan 12 '25 07:01 HawkClaws

All this solution as I can see use LLM for the whole page but instead to speed up can make sense to use it only for images. There could be PDFs with tons of images or nothing and use always a LLM it isn't a good option.

Mte90 avatar Jun 17 '25 17:06 Mte90