docling icon indicating copy to clipboard operation
docling copied to clipboard

Newer versions fail to include pdf table cells that are successfully handled in older versions

Open kurtgdl opened this issue 10 months ago • 3 comments

Bug

The docling versions 2.13.0 onwards fail to include PDF table text cells that are well captured by version 2.12.0.

Steps to reproduce

pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_table_images = True
pipeline_options.generate_picture_images = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE


doc_converter = DocumentConverter(
    allowed_formats=[
            InputFormat.PDF,
        ],
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options,
                                        backend=PyPdfiumDocumentBackend),
    }
)

Sorry I cannot provide the pdf file I used as it was an internal one and I was unable to make up one. But I include the data objects below.

Docling version

docling>=2.13.0

Python version

3.11, 3.10

Naive ways I tried

The issue didn't come from the ML model (ds4sd/docling-models) because the versions share it, but comes from the change of script in layout_model.py, which calls the new utility class in layout_postprocessor.py. Here is the related commit: https://github.com/DS4SD/docling/commit/60dc852f16dc1adbb5e9284c81a146043a301ec1.

Changing the scores in CONFIDENCE_THRESHOLDS of layout_postprocessor.py to the original configurations of version2.12.0has no effect.

The code flow: layout_model.py -> standard_pdf_pipeline (TableStructureModel) -> layout_predictor.py -> table_structure_model.py (line 200). This produces the table_cluster object. Here is part of the object:

  • Texts that were captured in export_to_markdown():
Cluster(id=63, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.57151794433594, t=156.7459716796875, r=298.26287841796875, b=162.767578125, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=1.0, cells=[Cell(id=43, text='ABC', bbox=BoundingBox(l=91.57151794433594, t=156.7459716796875, r=298.26287841796875, b=162.767578125, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT’>))], children=[]), 
Cluster(id=19, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.27823638916016, t=165.1458740234375, r=327.4078674316406, b=187.966552734375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.6519731283187866, cells=[Cell(id=44, text=‘EFG’, bbox=BoundingBox(l=91.30944061279297, t=165.1458740234375, r=327.4078674316406, b=171.16741943359375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))
  • Texts that failed to be captured:
Cluster(id=33, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.27202606201172, t=466.488525390625, r=319.7994384765625, b=489.31439208984375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.5935887098312378, cells=[Cell(id=69, text=‘111, bbox=BoundingBox(l=91.27202606201172, t=466.488525390625, r=186.20797729492188, b=472.4788818359375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=70, text=‘222 ', bbox=BoundingBox(l=91.80242919921875, t=483.28656005859375, r=319.7994384765625, b=489.31439208984375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))], children=[]), Cluster(id=52, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.51538848876953, t=491.68560791015625, r=323.0069274902344, b=497.71343994140625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.4870221018791199, cells=[Cell(id=71, text=‘333 ', bbox=BoundingBox(l=91.51538848876953, t=491.68560791015625, r=323.0069274902344, b=497.71343994140625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>))], children=[]), Cluster(id=24, label=<DocItemLabel.TEXT: 'text'>, bbox=BoundingBox(l=91.1409912109375, t=500.05340576171875, r=324.97491455078125, b=539.72412109375, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), confidence=0.6086541414260864, cells=[Cell(id=72, text='o4444’, bbox=BoundingBox(l=91.27826690673828, t=500.05340576171875, r=305.6815490722656, b=506.11248779296875, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=73, text=‘555 ', bbox=BoundingBox(l=91.27202606201172, t=508.48370361328125, r=324.43212890625, b=514.5115356445312, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=74, text='t555’, bbox=BoundingBox(l=91.1409912109375, t=516.8515014648438, r=281.83172607421875, b=522.8731079101562, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=75, text=‘6666’, bbox=BoundingBox(l=91.45295715332031, t=525.2659912109375, r=324.80145263671875, b=531.28759765625, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT’>))] 

kurtgdl avatar Feb 04 '25 12:02 kurtgdl

@kurtgdl can you please provide an example input that reproduces this problem? Many thanks.

cau-git avatar Feb 04 '25 12:02 cau-git

Hi @cau-git . Sorry I cannot provide you with a sample because the file I used is a private one, and I've been trying to create a similar file for the issue but to no avail.

kurtgdl avatar Feb 04 '25 14:02 kurtgdl

Hi @cau-git . I found that if I use the fast mode instead of the accurate one,

pipeline_options.table_structure_options.mode = TableFormerMode.FAST

the previously missing cell now appears. But I'm worried that the fast mode wouldn't be able to capture all the cells properly in other files.

kurtgdl avatar Feb 07 '25 10:02 kurtgdl

I am seeing something similar but cannot share files either. The ocr recognizes all cells correctly, but the table parser fails to locate them

amadou-6e avatar Apr 14 '25 12:04 amadou-6e

But can you fake these data ? and provide the sample ? @kurtgdl I mean, you can discard the data document, and add some dummy data

Erickrus avatar Aug 28 '25 08:08 Erickrus