docling Newer versions fail to include pdf table cells that are successfully handled in older versions

Bug

The docling versions 2.13.0 onwards fail to include PDF table text cells that are well captured by version 2.12.0.

Steps to reproduce

pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_table_images = True
pipeline_options.generate_picture_images = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE


doc_converter = DocumentConverter(
    allowed_formats=[
            InputFormat.PDF,
        ],
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options,
                                        backend=PyPdfiumDocumentBackend),
    }
)

Sorry I cannot provide the pdf file I used as it was an internal one and I was unable to make up one. But I include the data objects below.

Docling version

docling>=2.13.0

Python version

3.11, 3.10

Naive ways I tried

The issue didn't come from the ML model (ds4sd/docling-models) because the versions share it, but comes from the change of script in layout_model.py, which calls the new utility class in layout_postprocessor.py. Here is the related commit: https://github.com/DS4SD/docling/commit/60dc852f16dc1adbb5e9284c81a146043a301ec1.

Changing the scores in CONFIDENCE_THRESHOLDS of layout_postprocessor.py to the original configurations of version2.12.0has no effect.

The code flow: layout_model.py -> standard_pdf_pipeline (TableStructureModel) -> layout_predictor.py -> table_structure_model.py (line 200). This produces the table_cluster object. Here is part of the object:

Texts that were captured in export_to_markdown():

Cluster(id=63, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.57151794433594, t=156.7459716796875, r=298.26287841796875, b=162.767578125, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=1.0, cells=[Cell(id=43, text='ABC', bbox=BoundingBox(l=91.57151794433594, t=156.7459716796875, r=298.26287841796875, b=162.767578125, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT’>))], children=[]), 
Cluster(id=19, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.27823638916016, t=165.1458740234375, r=327.4078674316406, b=187.966552734375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.6519731283187866, cells=[Cell(id=44, text=‘EFG’, bbox=BoundingBox(l=91.30944061279297, t=165.1458740234375, r=327.4078674316406, b=171.16741943359375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))

Texts that failed to be captured:

Cluster(id=33, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.27202606201172, t=466.488525390625, r=319.7994384765625, b=489.31439208984375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.5935887098312378, cells=[Cell(id=69, text=‘111, bbox=BoundingBox(l=91.27202606201172, t=466.488525390625, r=186.20797729492188, b=472.4788818359375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=70, text=‘222 ', bbox=BoundingBox(l=91.80242919921875, t=483.28656005859375, r=319.7994384765625, b=489.31439208984375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))], children=[]), Cluster(id=52, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.51538848876953, t=491.68560791015625, r=323.0069274902344, b=497.71343994140625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.4870221018791199, cells=[Cell(id=71, text=‘333 ', bbox=BoundingBox(l=91.51538848876953, t=491.68560791015625, r=323.0069274902344, b=497.71343994140625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))], children=[]), Cluster(id=24, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.1409912109375, t=500.05340576171875, r=324.97491455078125, b=539.72412109375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.6086541414260864, cells=[Cell(id=72, text='o4444’, bbox=BoundingBox(l=91.27826690673828, t=500.05340576171875, r=305.6815490722656, b=506.11248779296875, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=73, text=‘555 ', bbox=BoundingBox(l=91.27202606201172, t=508.48370361328125, r=324.43212890625, b=514.5115356445312, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=74, text='t555’, bbox=BoundingBox(l=91.1409912109375, t=516.8515014648438, r=281.83172607421875, b=522.8731079101562, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=75, text=‘6666’, bbox=BoundingBox(l=91.45295715332031, t=525.2659912109375, r=324.80145263671875, b=531.28759765625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT’>))]

Feb 04 '25 12:02 kurtgdl

@kurtgdl can you please provide an example input that reproduces this problem? Many thanks.

Feb 04 '25 12:02 cau-git

Hi @cau-git . Sorry I cannot provide you with a sample because the file I used is a private one, and I've been trying to create a similar file for the issue but to no avail.

Feb 04 '25 14:02 kurtgdl

Hi @cau-git . I found that if I use the fast mode instead of the accurate one,

pipeline_options.table_structure_options.mode = TableFormerMode.FAST

the previously missing cell now appears. But I'm worried that the fast mode wouldn't be able to capture all the cells properly in other files.

Feb 07 '25 10:02 kurtgdl

I am seeing something similar but cannot share files either. The ocr recognizes all cells correctly, but the table parser fails to locate them

Apr 14 '25 12:04 amadou-6e

But can you fake these data ? and provide the sample ? @kurtgdl I mean, you can discard the data document, and add some dummy data

Aug 28 '25 08:08 Erickrus