Complex Table Conversion Issue (Wrong order, key-value regions)
Bug
- Out of order conversion: it would be nice if headers(UNCLASSIFIED , American Football Conference (AFC) , AFC East) appear after text fields
- Keys missing their values: the text fields have values, but it is all separated by newlines in the text output. it would be amazing if the text fields had their values more closely associated (something like Code: DLPHN instead of what the output currently shows)
Steps to reproduce
import os
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from pathlib import Path
source = "input/complex table.pdf"
pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = True # uses text cells predicted from table structure model
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = doc_converter.convert(source)
## Export results
output_dir = Path("resultconversion")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = result.input.file.stem
# Export Text format:
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
fp.write(result.document.export_to_text())
Docling version
2.4.0
Python version
3.12.4
Thank you @DAVIDCRUZ0202 !
We will need to update how we detect key-value regions.
to add on, I think there might also be a bug in how docling is not able to label this as a key-value pair. But rather, "Code:" as text and "DLPHN" as text. Is this the expected behavior? If so, I was wondering if there is any way to extract key-value pairs like we do for tables, headers, etc.
I think there might also be a bug in how docling is not able to label this as a key-value pair. But rather, "Code:" as text and "DLPHN" as text. Is this the expected behavio
Hi @mawil21 , while the layout model in docling can identify key-value regions, currently these regions are simply treated as text by downstream processing. The ability to preserve the structure within key-value regions (and forms) is on our roadmap. You can find more details here.
Note: Current version of docling (2.17.0) has the headers ordering sorted, but we are continuing to work on proper key-value placemement.
Output: