MARKITUP - Revert back from markdown to original document
Any idea to revert (convert back) a markdown file to the original file.
For example, given a Docx -> MarkedDown-File -> Filled by LLM -> MarkedDown-File -> Docx
Thanks @Jeremaiha for the question. Just to clarify, if we already have the original docx or pdf files, is the goal to convert edited markdown back into those formats after LLM changes? Or is this for workflows where only markdown is saved and the original file isn't kept? Just want to understand the use case better.
Thank you @Savvythelegend To clarify, the use-case is such that we have an existing template document. We use your library to generate the markdown, and are able to fill the documents. But we're unable to revert back to the original form of the input document.
Specifically, it's easier if you have the initial template document, together with the markdown document(filled).
๐ Hello! I'd love to work on this feature.
My plan is to build a basic Markdown-to-DOCX converter using python-docx and link it to MarkItDown workflows.
I'll start prototyping and open a draft PR soon. ๐
Yes please do, that will be highly beneficial
`import markdown from docx import Document from bs4 import BeautifulSoup
def markdown_to_docx(md_text, output_file="output.docx"): html = markdown.markdown(md_text) soup = BeautifulSoup(html, "html.parser") doc = Document()
for el in soup.descendants:
if el.name == "h1":
doc.add_heading(el.get_text(), level=1)
elif el.name == "h2":
doc.add_heading(el.get_text(), level=2)
elif el.name == "p":
doc.add_paragraph(el.get_text())
elif el.name == "li":
doc.add_paragraph("โข " + el.get_text(), style='ListBullet')
doc.save(output_file)
print(f"โ
ุชู
ุงูุญูุธ: {output_file}")
`
`from docx import Document import html2text
def docx_to_markdown(input_file="input.docx"): doc = Document(input_file) html = ""
for para in doc.paragraphs:
style = para.style.name
text = para.text.strip()
if not text:
continue
if style.startswith("Heading 1"):
html += f"<h1>{text}</h1>\n"
elif style.startswith("Heading 2"):
html += f"<h2>{text}</h2>\n"
elif style.startswith("Heading 3"):
html += f"<h3>{text}</h3>\n"
elif style.startswith("List"):
html += f"<li>{text}</li>\n"
else:
html += f"<p>{text}</p>\n"
markdown_text = html2text.html2text(html)
return markdown_text
`
`# ุชุญููู ู ู Markdown ุฅูู Word markdown_content = """
ุนููุงู
ูุต ุชุฌุฑูุจู ูุชุญููู Markdown ุฅูู Word.
ุนููุงู ูุฑุนู
- ุนูุตุฑ ุฃูู
- ุนูุตุฑ ุซุงูู """ markdown_to_docx(markdown_content, "ู ู_ู ุงุฑูุฏุงูู_ุฅูู_ููุฑุฏ.docx")
ุชุญููู ู ู Word ุฅูู Markdown
md_result = docx_to_markdown("ู ู_ู ุงุฑูุฏุงูู_ุฅูู_ููุฑุฏ.docx") print("โ Markdown ุงููุงุชุฌ ู ู ุงูู ูู:") print(md_result)
`
Hi team ๐
I'm submitting this message to confirm that Iโve implemented the full bi-directional conversion feature: Markdown โ๏ธ Word (DOCX). Due to a temporary issue while creating the pull request, Iโve provided the full working code and explanation here in the issue for now.
โ
Markdown โ Word: using python-docx
โ
Word โ Markdown: using html2text + mammoth or equivalent parsing logic
I will finalize the PR once the issue is resolved.
This is the first implementation of its kind and could greatly expand MarkItDownโs capabilities. Let me know if you'd like me to package it as a module.
Thanks ๐
โ @yossefelnggar
Any updates on this feature? Would be useful to have for docs and of course other basic file types as well
Can you provide us a snippet for the code? @yossefelnggar