markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Table headers of docx appear in body when using Markdownify>=1.0.0

Open Sillocan opened this issue 10 months ago • 2 comments

When I have a table with headers, it is now being interpreted as a part of the table cells. You can see the exact versions of the packages used in the script content below.

Actual behavior:

$ uv run issue-recreation.py
$ cat Test.md
|  |  |  |
| --- | --- | --- |
| Product | Quantity | Price |
| Apple | 10 | $1.00 |
| Banana | 5 | $0.50 |
| Cherry | 20 | $0.20 |

Expected behavior: When running with markdownify==0.14.0, the output appears as expected

$ uv run --with=markdownify==0.14.0 issue-recreation.py
$ cat Test.md
| Product | Quantity | Price |
| --- | --- | --- |
| Apple | 10 | $1.00 |
| Banana | 5 | $0.50 |
| Cherry | 20 | $0.20 |

Scripts

Here are the scripts I used to recreate the issue:

make-docx.py

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "python-docx",
# ]
# ///
from docx import Document
# Sample data
products = [
    ('Product', 'Quantity', 'Price'),
    ('Apple', '10', '$1.00'),
    ('Banana', '5', '$0.50'),
    ('Cherry', '20', '$0.20')
]

document = Document()
table = document.add_table(rows=4, cols=3)
# Populate the table with data
for row_idx, product in enumerate(products):
    for col_idx, item in enumerate(product):
        table.rows[row_idx].cells[col_idx].text = item

document.save('Test.docx')
print("Document 'Test.docx' has been created successfully.")

issue-recreation.py

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "annotated-types==0.7.0",
#     "anyio==4.8.0",
#     "audioop-lts==0.2.1 ; python_full_version >= '3.13'",
#     "azure-ai-documentintelligence==1.0.0",
#     "azure-core==1.32.0",
#     "azure-identity==1.20.0",
#     "beautifulsoup4==4.13.3",
#     "certifi==2025.1.31",
#     "cffi==1.17.1 ; platform_python_implementation != 'PyPy'",
#     "charset-normalizer==3.4.1",
#     "cobble==0.1.4",
#     "colorama==0.4.6 ; sys_platform == 'win32'",
#     "cryptography==44.0.2",
#     "defusedxml==0.7.1",
#     "distro==1.9.0",
#     "et-xmlfile==2.0.0",
#     "h11==0.14.0",
#     "httpcore==1.0.7",
#     "httpx==0.28.1",
#     "idna==3.10",
#     "isodate==0.7.2",
#     "jiter==0.8.2",
#     "lxml==5.3.1",
#     "mammoth==1.9.0",
#     "markdownify==1.0.0",
#     "markitdown==0.0.1a5",
#     "msal==1.31.1",
#     "msal-extensions==1.2.0",
#     "numpy==2.2.3",
#     "olefile==0.47",
#     "openai==1.65.2",
#     "openpyxl==3.1.5",
#     "pandas==2.2.3",
#     "pathvalidate==3.2.3",
#     "pdfminer-six==20240706",
#     "pillow==11.1.0",
#     "portalocker==2.10.1",
#     "puremagic==1.28",
#     "pycparser==2.22 ; platform_python_implementation != 'PyPy'",
#     "pydantic==2.10.6",
#     "pydantic-core==2.27.2",
#     "pydub==0.25.1",
#     "pyjwt==2.10.1",
#     "python-dateutil==2.9.0.post0",
#     "python-pptx==1.0.2",
#     "pytz==2025.1",
#     "pywin32==308 ; sys_platform == 'win32'",
#     "requests==2.32.3",
#     "six==1.17.0",
#     "sniffio==1.3.1",
#     "soupsieve==2.6",
#     "speechrecognition==3.14.1",
#     "standard-aifc==3.13.0 ; python_full_version >= '3.13'",
#     "standard-chunk==3.13.0 ; python_full_version >= '3.13'",
#     "tqdm==4.67.1",
#     "typing-extensions==4.12.2",
#     "tzdata==2025.1",
#     "urllib3==2.3.0",
#     "xlrd==2.0.1",
#     "xlsxwriter==3.2.2",
#     "youtube-transcript-api==0.6.3",
# ]
# ///
from pathlib import Path
from markitdown import MarkItDown

md = MarkItDown()
Path("Test.md").write_text(md.convert("Test.docx").text_content)

Sillocan avatar Mar 04 '25 01:03 Sillocan

Thanks for the report. The table header is written with <th> tags rather than <td> tags, and I wonder if that's what's breaking. I will see if I can debug this later in the week! Might be a problem upstream!

afourney avatar Mar 06 '25 07:03 afourney

Thanks for the report. The table header is written with <th> tags rather than <td> tags, and I wonder if that's what's breaking. I will see if I can debug this later in the week! Might be a problem upstream!

Any updates on this issue?

jnakhle-sabis avatar Apr 04 '25 07:04 jnakhle-sabis