docling icon indicating copy to clipboard operation
docling copied to clipboard

Converter seems to get stuck on very large pdfs

Open Convl opened this issue 8 months ago • 3 comments

Bug

When trying to parse very large pdf files (such as this one with over 3000 pages, the converter seems to get stuck. For instance, a 300 page pdf with similar content may take anywhere from 10 to 20 minutes on my system, but the 3000+ page file above did not finish converting even after 8 hours.

Steps to reproduce

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        )
    }
)
converter.convert("https://dserver.bundestag.de/brd/2024/0350-24.pdf")

Docling version

Docling version: 2.28.0 Docling Core version: 2.23.3 Docling IBM Models version: 3.4.1 Docling Parse version: 4.0.0 Python: cpython-312 (3.12.6) Platform: Windows-11-10.0.26100-SP0

Python version

Python 3.12.6

Convl avatar Apr 02 '25 08:04 Convl

Facing the same issue on mac as well despite including specifying the accelerator options.

glanzz avatar May 05 '25 03:05 glanzz

@glanzz @Convl were any one of you able to resolve this or find a work around this?

zeerak-wyne-sportsbet avatar May 23 '25 06:05 zeerak-wyne-sportsbet

@zeerak-wyne-sportsbet i wrote a simple script to convert my large pdf into smaller pdfs with less pages and then converted them.

glanzz avatar May 23 '25 17:05 glanzz