docling icon indicating copy to clipboard operation
docling copied to clipboard

Table representation misaligned between PDF and DOCX

Open vagenas opened this issue 1 year ago • 3 comments
trafficstars

Bug

The table representation appears misaligned between PDF and DOCX (depending on which one needs alignment, perhaps further formats are affected too).

Steps to reproduce

The snippet below uses the attached minimal example docs table.pdf and table.docx. The PDF is exported to a dataframe with explicit column headers, while for the DOCX the column headers are in the first normal row.

If the table representation within TableItem was the same, export_to_dataframe() would be the same too.

from docling.document_converter import DocumentConverter

def check_table(file_path):
    converter = DocumentConverter()
    doc = converter.convert(file_path).document
    table_item = next(doc.iterate_items())[0]
    print(table_item.export_to_dataframe())

check_table("table.pdf")
# >    Year Revenue Income Employees
# > 0  2014    92.7   12.0   379,592
# > 1  2015    81.7   13.1   377,757
# > 2  2016    79.9   11.8   380,300

check_table("table.docx")
# >       0        1       2          3
# > 0  Year  Revenue  Income  Employees
# > 1  2014     92.7    12.0    379,592
# > 2  2015     81.7    13.1    377,757
# > 3  2016     79.9    11.8    380,300

Docling version

Docling version: 2.6.0 Docling Core version: 2.4.0 Docling IBM Models version: 2.0.4 Docling Parse version: 2.0.4

Python version

Python 3.12.7

vagenas avatar Nov 19 '24 21:11 vagenas

@vagenas I think this could be because of header identification (not 💯 sure, but this would be my first guess). I think that the DOCX does not do any header identifcation, while pdf does.

PeterStaar-IBM avatar Nov 21 '24 10:11 PeterStaar-IBM

Indeed, at the moment col_header is explicitly set to False: https://github.com/DS4SD/docling/blob/eb64f6d368c5a13179b527ef0d755682c63b9b21/docling/backend/msword_backend.py#L481

vagenas avatar Nov 21 '24 10:11 vagenas

We could check if the docx api allows to detect something like the checkboxes (top-left in the figure)

image

dolfim-ibm avatar Nov 25 '24 09:11 dolfim-ibm