markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

How to ignore images when converting?

Open Pata-Mon opened this issue 8 months ago • 5 comments

For images in the document, markitdown will convert them into the following form: ![title](url) Is there any way to make markitdown ignore these images and not generate any content?

Pata-Mon avatar Mar 10 '25 02:03 Pata-Mon

I use this simple tool to ignore images

class Html2MarkdownConverter:
    def __init__(self):
        self.converter = MarkItDown()

    def convert(self, html_str: str) -> str:
        return self.converter.convert_stream(
            io.BytesIO(html_str.encode("utf8")), file_extension=".html"
        ).text_content

    def convert_without_images(self, html_str: str) -> str:
        markdown_text = self.convert(html_str)
        # remove images (pattern: ![alt text](URL))
        no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)

        # remove link (pattern: [text](URL))
        no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)

        # remove continuous \n (>2)
        cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)

        return cleaned_text

PhiFever avatar Mar 11 '25 02:03 PhiFever

I use this simple tool to ignore images

class Html2MarkdownConverter: def init(self): self.converter = MarkItDown()

def convert(self, html_str: str) -> str:
    return self.converter.convert_stream(
        io.BytesIO(html_str.encode("utf8")), file_extension=".html"
    ).text_content

def convert_without_images(self, html_str: str) -> str:
    markdown_text = self.convert(html_str)
    # remove images (pattern: ![alt text](URL))
    no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)

    # remove link (pattern: [text](URL))
    no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)

    # remove continuous \n (>2)
    cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)

    return cleaned_text

Thanks! The "remove link" is useful for me too.

So it seems that markitdown does not support ignoring images and links now? Hope it will be supported soon.

Pata-Mon avatar Mar 11 '25 03:03 Pata-Mon

Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.

I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.

In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?

afourney avatar Mar 11 '25 05:03 afourney

Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.

I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.

In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?

I suggest this to be optional, like:

markitdown path-to-file.pdf -o document.md --ignore-images md = MarkItDown(ignore_images=True) or markitdown path-to-file.pdf -o document.md --save-images path_to_save md = MarkItDown(save_images=path_to_save)

Pata-Mon avatar Mar 11 '25 05:03 Pata-Mon

Yes, for sure it will be optional/configurable.

afourney avatar Mar 11 '25 13:03 afourney