docling Converter seems to get stuck on very large pdfs

Bug

When trying to parse very large pdf files (such as this one with over 3000 pages, the converter seems to get stuck. For instance, a 300 page pdf with similar content may take anywhere from 10 to 20 minutes on my system, but the 3000+ page file above did not finish converting even after 8 hours.

Steps to reproduce

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
        )
    }
)
converter.convert("https://dserver.bundestag.de/brd/2024/0350-24.pdf")

Docling version

Docling version: 2.28.0 Docling Core version: 2.23.3 Docling IBM Models version: 3.4.1 Docling Parse version: 4.0.0 Python: cpython-312 (3.12.6) Platform: Windows-11-10.0.26100-SP0

Python version

Python 3.12.6

Apr 02 '25 08:04 Convl

Facing the same issue on mac as well despite including specifying the accelerator options.

May 05 '25 03:05 glanzz

@glanzz @Convl were any one of you able to resolve this or find a work around this?

May 23 '25 06:05 zeerak-wyne-sportsbet

@zeerak-wyne-sportsbet i wrote a simple script to convert my large pdf into smaller pdfs with less pages and then converted them.

May 23 '25 17:05 glanzz