docling icon indicating copy to clipboard operation
docling copied to clipboard

Complex Table Conversion Issue (Wrong order, key-value regions)

Open DAVIDCRUZ0202 opened this issue 1 year ago • 4 comments

Bug

  • Out of order conversion: it would be nice if headers(UNCLASSIFIED , American Football Conference (AFC) , AFC East) appear after text fields
  • Keys missing their values: the text fields have values, but it is all separated by newlines in the text output. it would be amazing if the text fields had their values more closely associated (something like Code: DLPHN instead of what the output currently shows)

Steps to reproduce

import os
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from pathlib import Path

source = "input/complex table.pdf"

pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = True  # uses text cells predicted from table structure model
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = doc_converter.convert(source)
## Export results
output_dir = Path("resultconversion")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = result.input.file.stem

# Export Text format:
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
    fp.write(result.document.export_to_text())
Screenshot 2024-11-07 at 11 41 20 AM

complex input table.pdf

Screenshot 2024-11-07 at 11 43 20 AM

complex output table.txt

Docling version

2.4.0

Python version

3.12.4

DAVIDCRUZ0202 avatar Nov 07 '24 16:11 DAVIDCRUZ0202

Thank you @DAVIDCRUZ0202 !

We will need to update how we detect key-value regions.

PeterStaar-IBM avatar Nov 08 '24 05:11 PeterStaar-IBM

to add on, I think there might also be a bug in how docling is not able to label this as a key-value pair. But rather, "Code:" as text and "DLPHN" as text. Is this the expected behavior? If so, I was wondering if there is any way to extract key-value pairs like we do for tables, headers, etc.

mawil21 avatar Nov 12 '24 06:11 mawil21

I think there might also be a bug in how docling is not able to label this as a key-value pair. But rather, "Code:" as text and "DLPHN" as text. Is this the expected behavio

Hi @mawil21 , while the layout model in docling can identify key-value regions, currently these regions are simply treated as text by downstream processing. The ability to preserve the structure within key-value regions (and forms) is on our roadmap. You can find more details here.

sh-gupta avatar Nov 12 '24 10:11 sh-gupta

Note: Current version of docling (2.17.0) has the headers ordering sorted, but we are continuing to work on proper key-value placemement.

Output:

Image

cau-git avatar Jan 30 '25 14:01 cau-git