Complete text in rows
Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?
@pankpy Could you please provide an example to illustrate the behaviour? Thanks.
Thank you. Please find attached files.
from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False # Not using scanned documents pipeline_options.do_table_structure = True
doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=PyPdfiumDocumentBackend # optional: pick an alternative backend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # default for office formats and HTML ), }, ) )
###############
ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single
print(ConversionResult.document.export_to_markdown())
print('VERIFY RESULT',ConversionResult.document) print('RESULT TYPE',type(ConversionResult.document))
for i, table in enumerate(ConversionResult.document.tables):
df = table.export_to_dataframe()
print(df)
df.to_excel(f'Output Sample_S df_{i}.xlsx')
Sample.pdf
Output Sample_S df_0.xlsx
Output Sample_S df_1.xlsx
@pankpy Please try with Docling v2.26.0, we updated Table model with new weights and it might produce better results