LLM Integration
As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.
I could see something coming back giving us markers of where that content should appear along with the source content in a way that we could handle it as we please. Ex: Send to LLM, upload to a server and link, etc.
maybe as an MCP server? That wouldn't be too hard
Relevant: https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L755 Ive tested an older version of it, but not the new version.
As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.
It's already supported, here you go. You need an openai key.
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")
image_file = 'https://gagb.github.io/imgs/bansal-chi21.png'
result = md.convert(source=image_file)
print(result.text_content)
Output
ImageSize: 388x368
# Description:
This image illustrates a comparison of decision accuracy with and without explanations. On the left, there's a computer icon connected to a user symbol, suggesting human-computer interaction. The vertical axis represents the "Accuracy of decisions," ranging from 0.0 to 1.0.
Two dotted lines extend horizontally: one labeled "R only" and another higher up labeled "R+Explanation." A red arrow points upward from "R only" to "R+Explanation," indicating an increase in accuracy with the addition of explanations. The red delta symbol (∆) signifies the improvement, which is greater than zero. A blue question mark next to it suggests uncertainty or inquiry related to this improvement.
At the bottom, the label "B Ours" signifies that this concept is part of the author's work or findings. Overall, the diagram emphasizes the value of explanations in enhancing decision-making accuracy in human-computer interactions.
@gagb Would be great to have this as an example in the README! Thanks.
@gagb Would be great to have this as an example in the README! Thanks.
Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.
As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.
It's already supported, here you go. You need an openai key.
from markitdown import MarkItDown from openai import OpenAI
client = OpenAI() md = MarkItDown(mlm_client=client, mlm_model="gpt-4o") image_file = 'https://gagb.github.io/imgs/bansal-chi21.png' result = md.convert(source=image_file) print(result.text_content) Output
ImageSize: 388x368
Description:
This image illustrates a comparison of decision accuracy with and without explanations. On the left, there's a computer icon connected to a user symbol, suggesting human-computer interaction. The vertical axis represents the "Accuracy of decisions," ranging from 0.0 to 1.0.
Two dotted lines extend horizontally: one labeled "R only" and another higher up labeled "R+Explanation." A red arrow points upward from "R only" to "R+Explanation," indicating an increase in accuracy with the addition of explanations. The red delta symbol (∆) signifies the improvement, which is greater than zero. A blue question mark next to it suggests uncertainty or inquiry related to this improvement.
At the bottom, the label "B Ours" signifies that this concept is part of the author's work or findings. Overall, the diagram emphasizes the value of explanations in enhancing decision-making accuracy in human-computer interactions.
Hi! Maybe I did not find the way and is already implemented but that looks more like the description of the image rather than the text extraction for example. Is there any way to customize the prompt?
As a feature suggestion, I would love if this could allow you to plug in an LLM (such as GPT-4o) where images that are included in the content could be sent to the LLM for further understanding. For example, if it was say a bar chart or graph, it would be sent to convert that to a Markdown table and be integrated in the final Markdown output.
It's already supported, here you go. You need an openai key. from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(mlm_client=client, mlm_model="gpt-4o") image_file = 'https://gagb.github.io/imgs/bansal-chi21.png' result = md.convert(source=image_file) print(result.text_content) Output ImageSize: 388x368
Description:
This image illustrates a comparison of decision accuracy with and without explanations. On the left, there's a computer icon connected to a user symbol, suggesting human-computer interaction. The vertical axis represents the "Accuracy of decisions," ranging from 0.0 to 1.0. Two dotted lines extend horizontally: one labeled "R only" and another higher up labeled "R+Explanation." A red arrow points upward from "R only" to "R+Explanation," indicating an increase in accuracy with the addition of explanations. The red delta symbol (∆) signifies the improvement, which is greater than zero. A blue question mark next to it suggests uncertainty or inquiry related to this improvement. At the bottom, the label "B Ours" signifies that this concept is part of the author's work or findings. Overall, the diagram emphasizes the value of explanations in enhancing decision-making accuracy in human-computer interactions.
Hi! Maybe I did not find the way and is already implemented but that looks more like the description of the image rather than the text extraction for example. Is there any way to customize the prompt?
It can customize prompt, just pass the mlm_prompt to convert method.
Like this:
# ...
prompt = "extract the objects in the picture, and only output keywords, separated by commas."
result = markitdown.convert("path/to/example.jpg", mlm_prompt=prompt)
But the latest version cannot apply the param properly when trying to convert an url file, and I just created a pr to fix it. #48
@gagb Would be great to have this as an example in the README! Thanks.
Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.
Exactly this. We could use something like PyMuPDF4LLM to extract images and classify them with VLM later, but if this could be done in one step during pdf->md conversion, that would be brilliant!
Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.
Completely agree. Basically, in pseudocode:
take in a PDF
go through it and for each image
send the image to an LLM for semantic understanding/text parsing (task TBD by the user)
replace the LLM output in the markdown (inline)
This way, the tool would be really useful, and the last issue to be resolved would be proper table parsing (from pdf)
@tinosai Lets move the pdf specific ideas to #131
In general I would love the ability to use hooks when converting any document containing images. Every time the converter finds an image, call me back with the byte[] so that I can handle the converstion of the image to text by myself. This may also happen with a local model and not just GPT.
The point is replacing the images with their description in any type of document. I mean that replacement implies keeping the sequence of text/image parts.
Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.
Completely agree. Basically, in pseudocode:
take in a PDF go through it and for each image send the image to an LLM for semantic understanding/text parsing (task TBD by the user) replace the LLM output in the markdown (inline)This way, the tool would be really useful, and the last issue to be resolved would be proper table parsing (from pdf)
If I could make one additional suggestion to this excellent pseudocode, it would be to include the ability to get the page number. Very likely this would be possible with the pseudocode, but I point it out because of the importance of this from a citation perspective. When doing RAG, you want to allow a user to go directly in the document to where the answer was generated. Since markdown does not have a concept of page numbers, you need to be able to let them go to something like the original PDF, and having the page number makes that a lot more efficient.
If I could make one additional suggestion to this excellent pseudocode, it would be to include the ability to get the page number.
@liamca You should never need that. You can store the page and even more details in the payload that you then associate to the embedding.
Embeddings should never go further than a page and in many cases the amount of text is way too much to get decent results.
I have sketched some of the code that takes care of passing the image to the LLM from the original PDF. However, the main problem is that in many circumstances pdfminer is unable to properly parse images in the document.
We need to replace pdfminer IMHO.
Agreed. IMO, a PDF based example would be best where it is able to do the text in the normal way, but any images in the PDF are sent to the LLM. I think this is a more compelling example to organizations then just PNG's.
Completely agree. Basically, in pseudocode:
take in a PDF go through it and for each image send the image to an LLM for semantic understanding/text parsing (task TBD by the user) replace the LLM output in the markdown (inline)This way, the tool would be really useful, and the last issue to be resolved would be proper table parsing (from pdf)
Perfect case why we need this feature. As an example, let's look at this slide from a PDF. This md extraction is no where a representation of what the slide conveys
### MarkItDown's Output
Crompton 2.0 continues to deliver results: Sustaining growth rates across ECD and Lighting
segments; Strong margin expansion in ECD; Profitability remains intact even with higher A&P
spends in Lighting
Q3 FY25 ECD revenue grew by 6% YoY;
EBIT rose 19% YoY to 196 Cr.
Q3 FY25 lighting revenue grew by 3% YoY to Rs. 257 Cr,
demonstrating continued momentum
Standalone ECD revenue (Rs. Cr)
Standalone lighting revenue (Rs. Cr)
14%
4,407
3,876
6%
1,288
1,209
4%
743
716
3%
257
249
9M FY24
9M FY25
Q3 FY24
Q3 FY25
9M FY24
9M FY25
Q3 FY24
Q3 FY25
ECD EBIT (Rs. Cr) & EBIT Margin
Lighting EBIT (Rs. Cr) & EBIT Margin
661
15.0%
27%
521
13.5%
19%
196
15.2%
164
13.6%
A&P
Spends
EBIT
5.5
80
11.2%
23.3
76
10.2%
3.2
28
8.0
28
11.2%
10.8%
9M FY24
9M FY25
Q3 FY24
Q3 FY25
Note: Standalone Financials
9M FY24
9M FY25
Q3 FY24
Q3 FY25
6