markitdown [Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support

Problem

The current DOCX-Table-to-Markdown conversion loses critical formatting for:

Merged cells (rowspan/colspan)
Complex tables (nested structures, multi-level headers)
Styling (borders, alignment)

Markdown’s native table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output.

Solution

Implemented a non-invasive override to output tables as HTML instead of Markdown, preserving structure and merged cells. Key changes:

CustomMarkdownify Class (extends _CustomMarkdownify):

Overrides convert_table(), convert_td(), convert_tr(), and convert_th() to return raw HTML elements.

Wraps tables in
to ensure valid HTML5 output.
CustomHtmlConverter & CustomDocxConverter:

Propagate the modified table handling while maintaining other conversions (e.g., text, headings).
CustomMarkitdown Class:

Swaps the default DocxConverter with CustomDocxConverter at runtime.

HTML table result example:

Code:

from typing import BinaryIO, Any

from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo

from common.log import logger


class CustomMarkdownify(_CustomMarkdownify):
    def convert_table(self, el, text, parent_tags):
        headers = [f"h{i}" for i in range(1, 8)]
        for h in headers:
            for h_element in el.find_all(h):
                h_element.unwrap()
        return f"<html><body>{el}</body></html>"

    def convert_td(self, el, text, parent_tags):
        return str(el)

    def convert_tr(self, el, text, parent_tags):
        return str(el)

    def convert_th(self, el, text, parent_tags):
        return str(el)


class CustomHtmlConverter(HtmlConverter):
    def convert(
            self,
            file_stream: BinaryIO,
            stream_info: StreamInfo,
            **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
            script.extract()

        # Print only the main content
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
        else:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)

        assert isinstance(webpage_text, str)

        # remove leading and trailing \n
        webpage_text = webpage_text.strip()

        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
        )


class CustomDocxConverter(DocxConverter):
    def __init__(self):
        super().__init__()
        self._html_converter = CustomHtmlConverter()


class CustomMarkitdown(MarkItDown):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.replace_converter()

    def replace_converter(self):
        for ix, convert in enumerate(self._converters):
            if isinstance(convert.converter, DocxConverter):
                self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
                                                             priority=PRIORITY_SPECIFIC_FILE_FORMAT)
                logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
                break


if __name__ == '__main__':
    markdown = CustomMarkitdown()
    md = markdown.convert('test.docx')
    with open("result.md", "w", encoding="utf-8") as f:
        f.write(md.markdown)

Benefits

✅ Perfect fidelity for merged/complex tables.
✅ No upstream breaks (override-based, doesn’t modify core logic).
✅ Works with renderers supporting HTML (GitHub, Typora, etc.).

Request

Consider merging this as an opt-in feature (e.g., via table_format="html" flag) or as the default behavior for complex tables.

Why This Matters

Many users need DOCX tables to render correctly in Markdown viewers.
HTML tables are the only reliable way to express merged cells in Markdown.

Apr 25 '25 07:04 tosmart01

这种感觉很有必要

Apr 27 '25 09:04 flyrun9527

md will lost some table message.

Jul 22 '25 07:07 Basil1991