docling Complete text in rows

Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?

Nov 04 '24 17:11 pankpy

@pankpy Could you please provide an example to illustrate the behaviour? Thanks.

Nov 05 '24 09:11 cau-git

Thank you. Please find attached files.

from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False # Not using scanned documents pipeline_options.do_table_structure = True

doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=PyPdfiumDocumentBackend # optional: pick an alternative backend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # default for office formats and HTML ), }, ) )

###############

ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single print(ConversionResult.document.export_to_markdown())

print('VERIFY RESULT',ConversionResult.document) print('RESULT TYPE',type(ConversionResult.document))

for i, table in enumerate(ConversionResult.document.tables): df = table.export_to_dataframe() print(df) df.to_excel(f'Output Sample_S df_{i}.xlsx') Sample.pdf Output Sample_S df_0.xlsx Output Sample_S df_1.xlsx Pycharm_prints

Nov 05 '24 12:11 pankpy

@pankpy Please try with Docling v2.26.0, we updated Table model with new weights and it might produce better results

Mar 17 '25 09:03 maxmnemonic