markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

MARKITUP - Revert back from markdown to original document

Open Jeremaiha opened this issue 5 months ago โ€ข 10 comments

Any idea to revert (convert back) a markdown file to the original file.

For example, given a Docx -> MarkedDown-File -> Filled by LLM -> MarkedDown-File -> Docx

Jeremaiha avatar Jul 16 '25 19:07 Jeremaiha

Thanks @Jeremaiha for the question. Just to clarify, if we already have the original docx or pdf files, is the goal to convert edited markdown back into those formats after LLM changes? Or is this for workflows where only markdown is saved and the original file isn't kept? Just want to understand the use case better.

tsvlgd avatar Jul 17 '25 09:07 tsvlgd

Thank you @Savvythelegend To clarify, the use-case is such that we have an existing template document. We use your library to generate the markdown, and are able to fill the documents. But we're unable to revert back to the original form of the input document.

Specifically, it's easier if you have the initial template document, together with the markdown document(filled).

Jeremaiha avatar Jul 17 '25 17:07 Jeremaiha

๐Ÿ‘‹ Hello! I'd love to work on this feature.
My plan is to build a basic Markdown-to-DOCX converter using python-docx and link it to MarkItDown workflows.
I'll start prototyping and open a draft PR soon. ๐Ÿš€

yossefelnggar avatar Jul 17 '25 21:07 yossefelnggar

Yes please do, that will be highly beneficial

Jeremaiha avatar Jul 17 '25 22:07 Jeremaiha

`import markdown from docx import Document from bs4 import BeautifulSoup

def markdown_to_docx(md_text, output_file="output.docx"): html = markdown.markdown(md_text) soup = BeautifulSoup(html, "html.parser") doc = Document()

for el in soup.descendants:
    if el.name == "h1":
        doc.add_heading(el.get_text(), level=1)
    elif el.name == "h2":
        doc.add_heading(el.get_text(), level=2)
    elif el.name == "p":
        doc.add_paragraph(el.get_text())
    elif el.name == "li":
        doc.add_paragraph("โ€ข " + el.get_text(), style='ListBullet')

doc.save(output_file)
print(f"โœ… ุชู… ุงู„ุญูุธ: {output_file}")

`

yossefelnggar avatar Jul 18 '25 16:07 yossefelnggar

`from docx import Document import html2text

def docx_to_markdown(input_file="input.docx"): doc = Document(input_file) html = ""

for para in doc.paragraphs:
    style = para.style.name
    text = para.text.strip()
    if not text:
        continue
    if style.startswith("Heading 1"):
        html += f"<h1>{text}</h1>\n"
    elif style.startswith("Heading 2"):
        html += f"<h2>{text}</h2>\n"
    elif style.startswith("Heading 3"):
        html += f"<h3>{text}</h3>\n"
    elif style.startswith("List"):
        html += f"<li>{text}</li>\n"
    else:
        html += f"<p>{text}</p>\n"

markdown_text = html2text.html2text(html)
return markdown_text

`

yossefelnggar avatar Jul 18 '25 16:07 yossefelnggar

`# ุชุญูˆูŠู„ ู…ู† Markdown ุฅู„ู‰ Word markdown_content = """

ุนู†ูˆุงู†

ู†ุต ุชุฌุฑูŠุจูŠ ู„ุชุญูˆูŠู„ Markdown ุฅู„ู‰ Word.

ุนู†ูˆุงู† ูุฑุนูŠ

  • ุนู†ุตุฑ ุฃูˆู„
  • ุนู†ุตุฑ ุซุงู†ูŠ """ markdown_to_docx(markdown_content, "ู…ู†_ู…ุงุฑูƒุฏุงูˆู†_ุฅู„ู‰_ูˆูˆุฑุฏ.docx")

ุชุญูˆูŠู„ ู…ู† Word ุฅู„ู‰ Markdown

md_result = docx_to_markdown("ู…ู†_ู…ุงุฑูƒุฏุงูˆู†_ุฅู„ู‰_ูˆูˆุฑุฏ.docx") print("โœ… Markdown ุงู„ู†ุงุชุฌ ู…ู† ุงู„ู…ู„ู:") print(md_result)

`

yossefelnggar avatar Jul 18 '25 16:07 yossefelnggar

Hi team ๐Ÿ‘‹

I'm submitting this message to confirm that Iโ€™ve implemented the full bi-directional conversion feature: Markdown โ†”๏ธ Word (DOCX). Due to a temporary issue while creating the pull request, Iโ€™ve provided the full working code and explanation here in the issue for now.

โœ… Markdown โž Word: using python-docx โœ… Word โž Markdown: using html2text + mammoth or equivalent parsing logic

I will finalize the PR once the issue is resolved.

This is the first implementation of its kind and could greatly expand MarkItDownโ€™s capabilities. Let me know if you'd like me to package it as a module.

Thanks ๐Ÿ™
โ€” @yossefelnggar

yossefelnggar avatar Jul 18 '25 16:07 yossefelnggar

Any updates on this feature? Would be useful to have for docs and of course other basic file types as well

lucasrothman avatar Sep 24 '25 20:09 lucasrothman

Can you provide us a snippet for the code? @yossefelnggar

Jeremaiha avatar Oct 04 '25 11:10 Jeremaiha