markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support

Open tosmart01 opened this issue 8 months ago • 1 comments

Problem

The current DOCX-Table-to-Markdown conversion loses critical formatting for:

  • Merged cells (rowspan/colspan)

  • Complex tables (nested structures, multi-level headers)

  • Styling (borders, alignment)

Markdown’s native table syntax (| --- |) lacks support for these features, resulting in broken or oversimplified output.

Solution

Implemented a non-invasive override to output tables as HTML instead of Markdown, preserving structure and merged cells. Key changes:

  1. CustomMarkdownify Class (extends _CustomMarkdownify):

    Overrides convert_table(), convert_td(), convert_tr(), and convert_th() to return raw HTML elements.

    Wraps tables in

    to ensure valid HTML5 output.
  2. CustomHtmlConverter & CustomDocxConverter:

    Propagate the modified table handling while maintaining other conversions (e.g., text, headings).

  3. CustomMarkitdown Class:

    Swaps the default DocxConverter with CustomDocxConverter at runtime.

HTML table result example:

Image

Code:

from typing import BinaryIO, Any

from bs4 import BeautifulSoup
from markitdown._markitdown import ConverterRegistration, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown.converters import DocxConverter, HtmlConverter
from markitdown.converters._markdownify import _CustomMarkdownify
from markitdown import MarkItDown, DocumentConverterResult, StreamInfo

from common.log import logger


class CustomMarkdownify(_CustomMarkdownify):
    def convert_table(self, el, text, parent_tags):
        headers = [f"h{i}" for i in range(1, 8)]
        for h in headers:
            for h_element in el.find_all(h):
                h_element.unwrap()
        return f"<html><body>{el}</body></html>"

    def convert_td(self, el, text, parent_tags):
        return str(el)

    def convert_tr(self, el, text, parent_tags):
        return str(el)

    def convert_th(self, el, text, parent_tags):
        return str(el)


class CustomHtmlConverter(HtmlConverter):
    def convert(
            self,
            file_stream: BinaryIO,
            stream_info: StreamInfo,
            **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
            script.extract()

        # Print only the main content
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(body_elm)
        else:
            webpage_text = CustomMarkdownify(**kwargs).convert_soup(soup)

        assert isinstance(webpage_text, str)

        # remove leading and trailing \n
        webpage_text = webpage_text.strip()

        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
        )


class CustomDocxConverter(DocxConverter):
    def __init__(self):
        super().__init__()
        self._html_converter = CustomHtmlConverter()


class CustomMarkitdown(MarkItDown):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.replace_converter()

    def replace_converter(self):
        for ix, convert in enumerate(self._converters):
            if isinstance(convert.converter, DocxConverter):
                self._converters[ix] = ConverterRegistration(converter=CustomDocxConverter(),
                                                             priority=PRIORITY_SPECIFIC_FILE_FORMAT)
                logger.info(f"replace markitdown docx converter to custom converter: {CustomDocxConverter}")
                break


if __name__ == '__main__':
    markdown = CustomMarkitdown()
    md = markdown.convert('test.docx')
    with open("result.md", "w", encoding="utf-8") as f:
        f.write(md.markdown)

Benefits

  • Perfect fidelity for merged/complex tables.
  • No upstream breaks (override-based, doesn’t modify core logic).
  • Works with renderers supporting HTML (GitHub, Typora, etc.).

Request

Consider merging this as an opt-in feature (e.g., via table_format="html" flag) or as the default behavior for complex tables.


Why This Matters

  • Many users need DOCX tables to render correctly in Markdown viewers.
  • HTML tables are the only reliable way to express merged cells in Markdown.

tosmart01 avatar Apr 25 '25 07:04 tosmart01

这种感觉很有必要

flyrun9527 avatar Apr 27 '25 09:04 flyrun9527

md will lost some table message.

Basil1991 avatar Jul 22 '25 07:07 Basil1991