docling
docling copied to clipboard
Table representation misaligned between PDF and DOCX
Bug
The table representation appears misaligned between PDF and DOCX (depending on which one needs alignment, perhaps further formats are affected too).
Steps to reproduce
The snippet below uses the attached minimal example docs table.pdf and table.docx. The PDF is exported to a dataframe with explicit column headers, while for the DOCX the column headers are in the first normal row.
If the table representation within TableItem was the same, export_to_dataframe() would be the same too.
from docling.document_converter import DocumentConverter
def check_table(file_path):
converter = DocumentConverter()
doc = converter.convert(file_path).document
table_item = next(doc.iterate_items())[0]
print(table_item.export_to_dataframe())
check_table("table.pdf")
# > Year Revenue Income Employees
# > 0 2014 92.7 12.0 379,592
# > 1 2015 81.7 13.1 377,757
# > 2 2016 79.9 11.8 380,300
check_table("table.docx")
# > 0 1 2 3
# > 0 Year Revenue Income Employees
# > 1 2014 92.7 12.0 379,592
# > 2 2015 81.7 13.1 377,757
# > 3 2016 79.9 11.8 380,300
Docling version
Docling version: 2.6.0 Docling Core version: 2.4.0 Docling IBM Models version: 2.0.4 Docling Parse version: 2.0.4
Python version
Python 3.12.7
@vagenas I think this could be because of header identification (not 💯 sure, but this would be my first guess). I think that the DOCX does not do any header identifcation, while pdf does.
Indeed, at the moment col_header is explicitly set to False:
https://github.com/DS4SD/docling/blob/eb64f6d368c5a13179b527ef0d755682c63b9b21/docling/backend/msword_backend.py#L481
We could check if the docx api allows to detect something like the checkboxes (top-left in the figure)