markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

feat: Add 'convert_local_content' method to directly convert file content (str)

Open Athroniaeth opened this issue 11 months ago • 1 comments

Hello, maybe I didn't find it, but I couldn't figure out how to directly convert the content of a file (str) into markdown. This PR contains the unit tests for this method

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert_local_content("<h1>Hello World!</h1>", file_extension=".html")
print(result.text_content)

-->

# Hello World!

Athroniaeth avatar Jan 22 '25 21:01 Athroniaeth

@microsoft-github-policy-service agree

Athroniaeth avatar Jan 22 '25 21:01 Athroniaeth

MarkItDown deals with byte streams. You can get the same behavior by doing:

import io
input_data = b"<html><body><h1>Test</h1></body></html>"
result = markitdown.convert_stream(io.BytesIO(input_data), file_extension=".html")

If it's a string, then perhaps:

import io
input_data = "<html><body><h1>Test</h1></body></html>".encode("utf-8")
result = markitdown.convert_stream(io.BytesIO(input_data), file_extension=".html")

If this is a common enough pattern, I could imagine creating a convenience method. Perhaps convert_string, but would prefer the more explicit approach above rather than adding a new entry-point to maintain. Please let me know what you think.

afourney avatar Mar 06 '25 07:03 afourney

I thinks in spiders this pattern is common, it will grab the article content in html and convert it to markdown.

PhiFever avatar Mar 10 '25 07:03 PhiFever