markitdown How to ignore images when converting?

For images in the document, markitdown will convert them into the following form: ![title](url) Is there any way to make markitdown ignore these images and not generate any content?

Mar 10 '25 02:03 Pata-Mon

I use this simple tool to ignore images

class Html2MarkdownConverter:
    def __init__(self):
        self.converter = MarkItDown()

    def convert(self, html_str: str) -> str:
        return self.converter.convert_stream(
            io.BytesIO(html_str.encode("utf8")), file_extension=".html"
        ).text_content

    def convert_without_images(self, html_str: str) -> str:
        markdown_text = self.convert(html_str)
        # remove images (pattern: ![alt text](URL))
        no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)

        # remove link (pattern: [text](URL))
        no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)

        # remove continuous \n (>2)
        cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)

        return cleaned_text

Mar 11 '25 02:03 PhiFever

I use this simple tool to ignore images

class Html2MarkdownConverter: def init(self): self.converter = MarkItDown()

def convert(self, html_str: str) -> str:
    return self.converter.convert_stream(
        io.BytesIO(html_str.encode("utf8")), file_extension=".html"
    ).text_content

def convert_without_images(self, html_str: str) -> str:
    markdown_text = self.convert(html_str)
    # remove images (pattern: ![alt text](URL))
    no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)

    # remove link (pattern: [text](URL))
    no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)

    # remove continuous \n (>2)
    cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)

    return cleaned_text

Thanks! The "remove link" is useful for me too.

So it seems that markitdown does not support ignoring images and links now? Hope it will be supported soon.

Mar 11 '25 03:03 Pata-Mon

Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.

I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.

In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?

Mar 11 '25 05:03 afourney

Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.

I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.

In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?

I suggest this to be optional, like:

markitdown path-to-file.pdf -o document.md --ignore-images md = MarkItDown(ignore_images=True) or markitdown path-to-file.pdf -o document.md --save-images path_to_save md = MarkItDown(save_images=path_to_save)

Mar 11 '25 05:03 Pata-Mon

Yes, for sure it will be optional/configurable.

Mar 11 '25 13:03 afourney

markitdown markitdown copied to clipboard

How to ignore images when converting?

markitdown
markitdown copied to clipboard