markitdown LLM Integration

As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.

Dec 13 '24 19:12 liamca

I could see something coming back giving us markers of where that content should appear along with the source content in a way that we could handle it as we please. Ex: Send to LLM, upload to a server and link, etc.

Dec 13 '24 19:12 dolbex

maybe as an MCP server? That wouldn't be too hard

Dec 13 '24 20:12 janzheng

Relevant: https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L755 Ive tested an older version of it, but not the new version.

Dec 13 '24 21:12 gagb

As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.

It's already supported, here you go. You need an openai key.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")
image_file = 'https://gagb.github.io/imgs/bansal-chi21.png'
result = md.convert(source=image_file)
print(result.text_content)

Output

ImageSize: 388x368

# Description:
This image illustrates a comparison of decision accuracy with and without explanations. On the left, there's a computer icon connected to a user symbol, suggesting human-computer interaction. The vertical axis represents the "Accuracy of decisions," ranging from 0.0 to 1.0.

Two dotted lines extend horizontally: one labeled "R only" and another higher up labeled "R+Explanation." A red arrow points upward from "R only" to "R+Explanation," indicating an increase in accuracy with the addition of explanations. The red delta symbol (∆) signifies the improvement, which is greater than zero. A blue question mark next to it suggests uncertainty or inquiry related to this improvement.

At the bottom, the label "B Ours" signifies that this concept is part of the author's work or findings. Overall, the diagram emphasizes the value of explanations in enhancing decision-making accuracy in human-computer interactions.

Dec 13 '24 22:12 gagb

@gagb Would be great to have this as an example in the README! Thanks.

Dec 13 '24 23:12 sdhutchins

@gagb Would be great to have this as an example in the README! Thanks.

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Dec 14 '24 15:12 liamca

As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.

It's already supported, here you go. You need an openai key.

from markitdown import MarkItDown from openai import OpenAI

client = OpenAI() md = MarkItDown(mlm_client=client, mlm_model="gpt-4o") image_file = 'https://gagb.github.io/imgs/bansal-chi21.png' result = md.convert(source=image_file) print(result.text_content) Output

ImageSize: 388x368

Description:

This image illustrates a comparison of decision accuracy with and without explanations. On the left, there's a computer icon connected to a user symbol, suggesting human-computer interaction. The vertical axis represents the "Accuracy of decisions," ranging from 0.0 to 1.0.

Two dotted lines extend horizontally: one labeled "R only" and another higher up labeled "R+Explanation." A red arrow points upward from "R only" to "R+Explanation," indicating an increase in accuracy with the addition of explanations. The red delta symbol (∆) signifies the improvement, which is greater than zero. A blue question mark next to it suggests uncertainty or inquiry related to this improvement.

At the bottom, the label "B Ours" signifies that this concept is part of the author's work or findings. Overall, the diagram emphasizes the value of explanations in enhancing decision-making accuracy in human-computer interactions.

Hi! Maybe I did not find the way and is already implemented but that looks more like the description of the image rather than the text extraction for example. Is there any way to customize the prompt?

Dec 14 '24 17:12 alejlatorre

As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.

It's already supported, here you go. You need an openai key. from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(mlm_client=client, mlm_model="gpt-4o") image_file = 'https://gagb.github.io/imgs/bansal-chi21.png' result = md.convert(source=image_file) print(result.text_content) Output ImageSize: 388x368

Description:

This image illustrates a comparison of decision accuracy with and without explanations. On the left, there's a computer icon connected to a user symbol, suggesting human-computer interaction. The vertical axis represents the "Accuracy of decisions," ranging from 0.0 to 1.0. Two dotted lines extend horizontally: one labeled "R only" and another higher up labeled "R+Explanation." A red arrow points upward from "R only" to "R+Explanation," indicating an increase in accuracy with the addition of explanations. The red delta symbol (∆) signifies the improvement, which is greater than zero. A blue question mark next to it suggests uncertainty or inquiry related to this improvement. At the bottom, the label "B Ours" signifies that this concept is part of the author's work or findings. Overall, the diagram emphasizes the value of explanations in enhancing decision-making accuracy in human-computer interactions.

Hi! Maybe I did not find the way and is already implemented but that looks more like the description of the image rather than the text extraction for example. Is there any way to customize the prompt?

It can customize prompt, just pass the mlm_prompt to convert method.

Like this:

# ...
prompt = "extract the objects in the picture, and only output keywords, separated by commas."
result = markitdown.convert("path/to/example.jpg", mlm_prompt=prompt)

But the latest version cannot apply the param properly when trying to convert an url file, and I just created a pr to fix it. #48

Dec 16 '24 03:12 Soulter

@gagb Would be great to have this as an example in the README! Thanks.

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Exactly this. We could use something like PyMuPDF4LLM to extract images and classify them with VLM later, but if this could be done in one step during pdf->md conversion, that would be brilliant!

Dec 18 '24 15:12 vaclcer

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Completely agree. Basically, in pseudocode:

take in a PDF
go through it and for each image
    send the image to an LLM for semantic understanding/text parsing (task TBD by the user)
    replace the LLM output in the markdown (inline)

This way, the tool would be really useful, and the last issue to be resolved would be proper table parsing (from pdf)

Dec 18 '24 23:12 tinosai

@tinosai Lets move the pdf specific ideas to #131

Dec 18 '24 23:12 gagb

In general I would love the ability to use hooks when converting any document containing images. Every time the converter finds an image, call me back with the byte[] so that I can handle the converstion of the image to text by myself. This may also happen with a local model and not just GPT.

The point is replacing the images with their description in any type of document. I mean that replacement implies keeping the sequence of text/image parts.

Dec 25 '24 09:12 raffaeler

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Completely agree. Basically, in pseudocode:
take in a PDF
go through it and for each image
    send the image to an LLM for semantic understanding/text parsing (task TBD by the user)
    replace the LLM output in the markdown (inline)
This way, the tool would be really useful, and the last issue to be resolved would be proper table parsing (from pdf)

If I could make one additional suggestion to this excellent pseudocode, it would be to include the ability to get the page number. Very likely this would be possible with the pseudocode, but I point it out because of the importance of this from a citation perspective. When doing RAG, you want to allow a user to go directly in the document to where the answer was generated. Since markdown does not have a concept of page numbers, you need to be able to let them go to something like the original PDF, and having the page number makes that a lot more efficient.

Jan 07 '25 16:01 liamca

If I could make one additional suggestion to this excellent pseudocode, it would be to include the ability to get the page number.

@liamca You should never need that. You can store the page and even more details in the payload that you then associate to the embedding.

Embeddings should never go further than a page and in many cases the amount of text is way too much to get decent results.

Jan 07 '25 17:01 raffaeler

I have sketched some of the code that takes care of passing the image to the LLM from the original PDF. However, the main problem is that in many circumstances pdfminer is unable to properly parse images in the document.

We need to replace pdfminer IMHO.

Jan 07 '25 22:01 tinosai

Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.

Completely agree. Basically, in pseudocode:
take in a PDF
go through it and for each image
    send the image to an LLM for semantic understanding/text parsing (task TBD by the user)
    replace the LLM output in the markdown (inline)
This way, the tool would be really useful, and the last issue to be resolved would be proper table parsing (from pdf)

Perfect case why we need this feature. As an example, let's look at this slide from a PDF. This md extraction is no where a representation of what the slide conveys

### MarkItDown's Output
Crompton 2.0 continues to deliver results: Sustaining growth rates across ECD and Lighting
segments; Strong margin expansion in ECD; Profitability remains intact even with higher A&P
spends in Lighting

Q3 FY25 ECD revenue grew by 6% YoY;
EBIT rose 19% YoY to 196 Cr.

Q3 FY25 lighting revenue grew by 3% YoY to Rs. 257 Cr,
demonstrating continued momentum

Standalone ECD revenue (Rs. Cr)

Standalone lighting revenue (Rs. Cr)

14%

4,407

3,876

6%

1,288

1,209

4%

743

716

3%

257

249

9M FY24

9M FY25

Q3 FY24

Q3 FY25

9M FY24

9M FY25

Q3 FY24

Q3 FY25

ECD EBIT (Rs. Cr) & EBIT Margin

Lighting EBIT (Rs. Cr) & EBIT Margin

661

15.0%

27%

521

13.5%

19%

196
15.2%

164
13.6%

A&P
Spends

EBIT

5.5

80

11.2%

23.3

76

10.2%

3.2

28

8.0

28

11.2%

10.8%

9M FY24

9M FY25

Q3 FY24

Q3 FY25

Note: Standalone Financials

9M FY24

9M FY25

Q3 FY24

Q3 FY25

6

May 01 '25 03:05 AjjayK