markitdown
markitdown copied to clipboard
How to ignore images when converting?
For images in the document, markitdown will convert them into the following form:

Is there any way to make markitdown ignore these images and not generate any content?
I use this simple tool to ignore images
class Html2MarkdownConverter:
def __init__(self):
self.converter = MarkItDown()
def convert(self, html_str: str) -> str:
return self.converter.convert_stream(
io.BytesIO(html_str.encode("utf8")), file_extension=".html"
).text_content
def convert_without_images(self, html_str: str) -> str:
markdown_text = self.convert(html_str)
# remove images (pattern: )
no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)
# remove link (pattern: [text](URL))
no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)
# remove continuous \n (>2)
cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)
return cleaned_text
I use this simple tool to ignore images
class Html2MarkdownConverter: def init(self): self.converter = MarkItDown()
def convert(self, html_str: str) -> str: return self.converter.convert_stream( io.BytesIO(html_str.encode("utf8")), file_extension=".html" ).text_content def convert_without_images(self, html_str: str) -> str: markdown_text = self.convert(html_str) # remove images (pattern: ) no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text) # remove link (pattern: [text](URL)) no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images) # remove continuous \n (>2) cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links) return cleaned_text
Thanks! The "remove link" is useful for me too.
So it seems that markitdown does not support ignoring images and links now? Hope it will be supported soon.
Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.
I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.
In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?
Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.
I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.
In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?
I suggest this to be optional, like:
markitdown path-to-file.pdf -o document.md --ignore-images
md = MarkItDown(ignore_images=True)
or
markitdown path-to-file.pdf -o document.md --save-images path_to_save
md = MarkItDown(save_images=path_to_save)
Yes, for sure it will be optional/configurable.