Newer versions fail to include pdf table cells that are successfully handled in older versions
Bug
The docling versions 2.13.0 onwards fail to include PDF table text cells that are well captured by version 2.12.0.
Steps to reproduce
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_table_images = True
pipeline_options.generate_picture_images = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
doc_converter = DocumentConverter(
allowed_formats=[
InputFormat.PDF,
],
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options,
backend=PyPdfiumDocumentBackend),
}
)
Sorry I cannot provide the pdf file I used as it was an internal one and I was unable to make up one. But I include the data objects below.
Docling version
docling>=2.13.0
Python version
3.11, 3.10
Naive ways I tried
The issue didn't come from the ML model (ds4sd/docling-models) because the versions share it, but comes from the change of script in layout_model.py, which calls the new utility class in layout_postprocessor.py. Here is the related commit: https://github.com/DS4SD/docling/commit/60dc852f16dc1adbb5e9284c81a146043a301ec1.
Changing the scores in CONFIDENCE_THRESHOLDS of layout_postprocessor.py to the original configurations of version2.12.0has no effect.
The code flow: layout_model.py -> standard_pdf_pipeline (TableStructureModel) -> layout_predictor.py -> table_structure_model.py (line 200). This produces the table_cluster object.
Here is part of the object:
- Texts that were captured in export_to_markdown():
Cluster(id=63, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.57151794433594, t=156.7459716796875, r=298.26287841796875, b=162.767578125, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=1.0, cells=[Cell(id=43, text='ABC', bbox=BoundingBox(l=91.57151794433594, t=156.7459716796875, r=298.26287841796875, b=162.767578125, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT’>))], children=[]),
Cluster(id=19, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.27823638916016, t=165.1458740234375, r=327.4078674316406, b=187.966552734375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.6519731283187866, cells=[Cell(id=44, text=‘EFG’, bbox=BoundingBox(l=91.30944061279297, t=165.1458740234375, r=327.4078674316406, b=171.16741943359375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))
- Texts that failed to be captured:
Cluster(id=33, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.27202606201172, t=466.488525390625, r=319.7994384765625, b=489.31439208984375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.5935887098312378, cells=[Cell(id=69, text=‘111, bbox=BoundingBox(l=91.27202606201172, t=466.488525390625, r=186.20797729492188, b=472.4788818359375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=70, text=‘222 ', bbox=BoundingBox(l=91.80242919921875, t=483.28656005859375, r=319.7994384765625, b=489.31439208984375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))], children=[]), Cluster(id=52, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.51538848876953, t=491.68560791015625, r=323.0069274902344, b=497.71343994140625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.4870221018791199, cells=[Cell(id=71, text=‘333 ', bbox=BoundingBox(l=91.51538848876953, t=491.68560791015625, r=323.0069274902344, b=497.71343994140625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))], children=[]), Cluster(id=24, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.1409912109375, t=500.05340576171875, r=324.97491455078125, b=539.72412109375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.6086541414260864, cells=[Cell(id=72, text='o4444’, bbox=BoundingBox(l=91.27826690673828, t=500.05340576171875, r=305.6815490722656, b=506.11248779296875, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=73, text=‘555 ', bbox=BoundingBox(l=91.27202606201172, t=508.48370361328125, r=324.43212890625, b=514.5115356445312, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=74, text='t555’, bbox=BoundingBox(l=91.1409912109375, t=516.8515014648438, r=281.83172607421875, b=522.8731079101562, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=75, text=‘6666’, bbox=BoundingBox(l=91.45295715332031, t=525.2659912109375, r=324.80145263671875, b=531.28759765625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT’>))]
@kurtgdl can you please provide an example input that reproduces this problem? Many thanks.
Hi @cau-git . Sorry I cannot provide you with a sample because the file I used is a private one, and I've been trying to create a similar file for the issue but to no avail.
Hi @cau-git . I found that if I use the fast mode instead of the accurate one,
pipeline_options.table_structure_options.mode = TableFormerMode.FAST
the previously missing cell now appears. But I'm worried that the fast mode wouldn't be able to capture all the cells properly in other files.
I am seeing something similar but cannot share files either. The ocr recognizes all cells correctly, but the table parser fails to locate them
But can you fake these data ? and provide the sample ? @kurtgdl I mean, you can discard the data document, and add some dummy data