docling icon indicating copy to clipboard operation
docling copied to clipboard

Complete text in rows

Open pankpy opened this issue 1 year ago • 2 comments

Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?

pankpy avatar Nov 04 '24 17:11 pankpy

@pankpy Could you please provide an example to illustrate the behaviour? Thanks.

cau-git avatar Nov 05 '24 09:11 cau-git

Thank you. Please find attached files.

from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False # Not using scanned documents pipeline_options.do_table_structure = True

doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=PyPdfiumDocumentBackend # optional: pick an alternative backend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # default for office formats and HTML ), }, ) )

###############

ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single print(ConversionResult.document.export_to_markdown())

print('VERIFY RESULT',ConversionResult.document) print('RESULT TYPE',type(ConversionResult.document))

for i, table in enumerate(ConversionResult.document.tables): df = table.export_to_dataframe() print(df) df.to_excel(f'Output Sample_S df_{i}.xlsx') Sample.pdf Output Sample_S df_0.xlsx Output Sample_S df_1.xlsx Pycharm_prints

pankpy avatar Nov 05 '24 12:11 pankpy

@pankpy Please try with Docling v2.26.0, we updated Table model with new weights and it might produce better results

maxmnemonic avatar Mar 17 '25 09:03 maxmnemonic