Table headers of docx appear in body when using Markdownify>=1.0.0
When I have a table with headers, it is now being interpreted as a part of the table cells. You can see the exact versions of the packages used in the script content below.
Actual behavior:
$ uv run issue-recreation.py
$ cat Test.md
| | | |
| --- | --- | --- |
| Product | Quantity | Price |
| Apple | 10 | $1.00 |
| Banana | 5 | $0.50 |
| Cherry | 20 | $0.20 |
Expected behavior:
When running with markdownify==0.14.0, the output appears as expected
$ uv run --with=markdownify==0.14.0 issue-recreation.py
$ cat Test.md
| Product | Quantity | Price |
| --- | --- | --- |
| Apple | 10 | $1.00 |
| Banana | 5 | $0.50 |
| Cherry | 20 | $0.20 |
Scripts
Here are the scripts I used to recreate the issue:
make-docx.py
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "python-docx",
# ]
# ///
from docx import Document
# Sample data
products = [
('Product', 'Quantity', 'Price'),
('Apple', '10', '$1.00'),
('Banana', '5', '$0.50'),
('Cherry', '20', '$0.20')
]
document = Document()
table = document.add_table(rows=4, cols=3)
# Populate the table with data
for row_idx, product in enumerate(products):
for col_idx, item in enumerate(product):
table.rows[row_idx].cells[col_idx].text = item
document.save('Test.docx')
print("Document 'Test.docx' has been created successfully.")
issue-recreation.py
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "annotated-types==0.7.0",
# "anyio==4.8.0",
# "audioop-lts==0.2.1 ; python_full_version >= '3.13'",
# "azure-ai-documentintelligence==1.0.0",
# "azure-core==1.32.0",
# "azure-identity==1.20.0",
# "beautifulsoup4==4.13.3",
# "certifi==2025.1.31",
# "cffi==1.17.1 ; platform_python_implementation != 'PyPy'",
# "charset-normalizer==3.4.1",
# "cobble==0.1.4",
# "colorama==0.4.6 ; sys_platform == 'win32'",
# "cryptography==44.0.2",
# "defusedxml==0.7.1",
# "distro==1.9.0",
# "et-xmlfile==2.0.0",
# "h11==0.14.0",
# "httpcore==1.0.7",
# "httpx==0.28.1",
# "idna==3.10",
# "isodate==0.7.2",
# "jiter==0.8.2",
# "lxml==5.3.1",
# "mammoth==1.9.0",
# "markdownify==1.0.0",
# "markitdown==0.0.1a5",
# "msal==1.31.1",
# "msal-extensions==1.2.0",
# "numpy==2.2.3",
# "olefile==0.47",
# "openai==1.65.2",
# "openpyxl==3.1.5",
# "pandas==2.2.3",
# "pathvalidate==3.2.3",
# "pdfminer-six==20240706",
# "pillow==11.1.0",
# "portalocker==2.10.1",
# "puremagic==1.28",
# "pycparser==2.22 ; platform_python_implementation != 'PyPy'",
# "pydantic==2.10.6",
# "pydantic-core==2.27.2",
# "pydub==0.25.1",
# "pyjwt==2.10.1",
# "python-dateutil==2.9.0.post0",
# "python-pptx==1.0.2",
# "pytz==2025.1",
# "pywin32==308 ; sys_platform == 'win32'",
# "requests==2.32.3",
# "six==1.17.0",
# "sniffio==1.3.1",
# "soupsieve==2.6",
# "speechrecognition==3.14.1",
# "standard-aifc==3.13.0 ; python_full_version >= '3.13'",
# "standard-chunk==3.13.0 ; python_full_version >= '3.13'",
# "tqdm==4.67.1",
# "typing-extensions==4.12.2",
# "tzdata==2025.1",
# "urllib3==2.3.0",
# "xlrd==2.0.1",
# "xlsxwriter==3.2.2",
# "youtube-transcript-api==0.6.3",
# ]
# ///
from pathlib import Path
from markitdown import MarkItDown
md = MarkItDown()
Path("Test.md").write_text(md.convert("Test.docx").text_content)
Thanks for the report. The table header is written with <th> tags rather than <td> tags, and I wonder if that's what's breaking. I will see if I can debug this later in the week! Might be a problem upstream!
Thanks for the report. The table header is written with
<th>tags rather than<td>tags, and I wonder if that's what's breaking. I will see if I can debug this later in the week! Might be a problem upstream!
Any updates on this issue?