[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support
Problem
The current DOCX-Table-to-Markdown conversion loses critical formatting for:
-
Merged cells (rowspan/colspan)
-
Complex tables (nested structures, multi-level headers)
-
Styling (borders, alignment)
Markdown’s native table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output.
Solution
Implemented a non-invasive override to output tables as HTML instead of Markdown, preserving structure and merged cells. Key changes:
-
CustomMarkdownify Class (extends _CustomMarkdownify):
Overrides convert_table(), convert_td(), convert_tr(), and convert_th() to return raw HTML elements.
Wraps tables in
to ensure valid HTML5 output. -
CustomHtmlConverter & CustomDocxConverter:
Propagate the modified table handling while maintaining other conversions (e.g., text, headings).
-
CustomMarkitdown Class:
Swaps the default DocxConverter with CustomDocxConverter at runtime.
HTML table result example:
Code:
from typing import BinaryIO, Any
from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo
from common.log import logger
class CustomMarkdownify(_CustomMarkdownify):
def convert_table(self, el, text, parent_tags):
headers = [f"h{i}" for i in range(1, 8)]
for h in headers:
for h_element in el.find_all(h):
h_element.unwrap()
return f"<html><body>{el}</body></html>"
def convert_td(self, el, text, parent_tags):
return str(el)
def convert_tr(self, el, text, parent_tags):
return str(el)
def convert_th(self, el, text, parent_tags):
return str(el)
class CustomHtmlConverter(HtmlConverter):
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Parse the stream
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
# Remove javascript and style blocks
for script in soup(["script", "style"]):
script.extract()
# Print only the main content
body_elm = soup.find("body")
webpage_text = ""
if body_elm:
webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
else:
webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)
assert isinstance(webpage_text, str)
# remove leading and trailing \n
webpage_text = webpage_text.strip()
return DocumentConverterResult(
markdown=webpage_text,
title=None if soup.title is None else soup.title.string,
)
class CustomDocxConverter(DocxConverter):
def __init__(self):
super().__init__()
self._html_converter = CustomHtmlConverter()
class CustomMarkitdown(MarkItDown):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.replace_converter()
def replace_converter(self):
for ix, convert in enumerate(self._converters):
if isinstance(convert.converter, DocxConverter):
self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
priority=PRIORITY_SPECIFIC_FILE_FORMAT)
logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
break
if __name__ == '__main__':
markdown = CustomMarkitdown()
md = markdown.convert('test.docx')
with open("result.md", "w", encoding="utf-8") as f:
f.write(md.markdown)
Benefits
- ✅ Perfect fidelity for merged/complex tables.
- ✅ No upstream breaks (override-based, doesn’t modify core logic).
- ✅ Works with renderers supporting HTML (GitHub, Typora, etc.).
Request
Consider merging this as an opt-in feature (e.g., via table_format="html" flag) or as the default behavior for complex tables.
Why This Matters
- Many users need DOCX tables to render correctly in Markdown viewers.
- HTML tables are the only reliable way to express merged cells in Markdown.
这种感觉很有必要
md will lost some table message.