docling Table representation misaligned between PDF and DOCX

Table representation misaligned between PDF and DOCX

Open vagenas opened this issue 1 year ago • 3 comments

trafficstars

Bug

The table representation appears misaligned between PDF and DOCX (depending on which one needs alignment, perhaps further formats are affected too).

Steps to reproduce

The snippet below uses the attached minimal example docs table.pdf and table.docx. The PDF is exported to a dataframe with explicit column headers, while for the DOCX the column headers are in the first normal row.

If the table representation within TableItem was the same, export_to_dataframe() would be the same too.

from docling.document_converter import DocumentConverter

def check_table(file_path):
    converter = DocumentConverter()
    doc = converter.convert(file_path).document
    table_item = next(doc.iterate_items())[0]
    print(table_item.export_to_dataframe())

check_table("table.pdf")
# >    Year Revenue Income Employees
# > 0  2014    92.7   12.0   379,592
# > 1  2015    81.7   13.1   377,757
# > 2  2016    79.9   11.8   380,300

check_table("table.docx")
# >       0        1       2          3
# > 0  Year  Revenue  Income  Employees
# > 1  2014     92.7    12.0    379,592
# > 2  2015     81.7    13.1    377,757
# > 3  2016     79.9    11.8    380,300

Docling version

Docling version: 2.6.0 Docling Core version: 2.4.0 Docling IBM Models version: 2.0.4 Docling Parse version: 2.0.4

Python version

Python 3.12.7

Nov 19 '24 21:11 vagenas

@vagenas I think this could be because of header identification (not 💯 sure, but this would be my first guess). I think that the DOCX does not do any header identifcation, while pdf does.

Nov 21 '24 10:11 PeterStaar-IBM

Indeed, at the moment col_header is explicitly set to False: https://github.com/DS4SD/docling/blob/eb64f6d368c5a13179b527ef0d755682c63b9b21/docling/backend/msword_backend.py#L481

Nov 21 '24 10:11 vagenas

We could check if the docx api allows to detect something like the checkboxes (top-left in the figure)

Nov 25 '24 09:11 dolfim-ibm

docling docling copied to clipboard

Table representation misaligned between PDF and DOCX

Bug

Steps to reproduce

Docling version

Python version

docling
docling copied to clipboard