docling
docling copied to clipboard
PDF tables not parsed as markdown tables
Bug
A large table in a pdf file is not parsed as a markdown table, but instead as just sequential lines. Furthermore, cells jump out of their original place.
File: Auto_history_05122024.pdf
Steps to reproduce
IMAGE_RESOLUTION_SCALE = 2.0
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_table_images = True
pipeline_options.generate_picture_images = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
doc_converter = DocumentConverter(
allowed_formats=[
InputFormat.PDF,
],
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options,
backend=PyPdfiumDocumentBackend ### $$$
),
}
)
Docling version
docling>=2.12.0
Python version
3.11
@kurtgdl We know of this problem for table-parsing of fat tables (with lots of text in the cells). New model is training which should solve this problem!
Stay tuned!
@kurtgdl Please try with Docling v2.26.0, we updated Table model with new weights and it should address your issue.