docling icon indicating copy to clipboard operation
docling copied to clipboard

PDF tables not parsed as markdown tables

Open kurtgdl opened this issue 10 months ago • 1 comments

Bug

A large table in a pdf file is not parsed as a markdown table, but instead as just sequential lines. Furthermore, cells jump out of their original place.

Image

Image

File: Auto_history_05122024.pdf

Steps to reproduce

IMAGE_RESOLUTION_SCALE = 2.0
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_table_images = True
pipeline_options.generate_picture_images = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

doc_converter = DocumentConverter(
    allowed_formats=[
            InputFormat.PDF,
        ],
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options,
                                        backend=PyPdfiumDocumentBackend ### $$$
                                        ),
    }
)

Docling version

docling>=2.12.0

Python version

3.11

kurtgdl avatar Feb 11 '25 05:02 kurtgdl

@kurtgdl We know of this problem for table-parsing of fat tables (with lots of text in the cells). New model is training which should solve this problem!

Stay tuned!

PeterStaar-IBM avatar Feb 11 '25 05:02 PeterStaar-IBM

@kurtgdl Please try with Docling v2.26.0, we updated Table model with new weights and it should address your issue.

maxmnemonic avatar Mar 17 '25 09:03 maxmnemonic