Converter seems to get stuck on very large pdfs
Bug
When trying to parse very large pdf files (such as this one with over 3000 pages, the converter seems to get stuck. For instance, a 300 page pdf with similar content may take anywhere from 10 to 20 minutes on my system, but the 3000+ page file above did not finish converting even after 8 hours.
Steps to reproduce
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend
)
}
)
converter.convert("https://dserver.bundestag.de/brd/2024/0350-24.pdf")
Docling version
Docling version: 2.28.0 Docling Core version: 2.23.3 Docling IBM Models version: 3.4.1 Docling Parse version: 4.0.0 Python: cpython-312 (3.12.6) Platform: Windows-11-10.0.26100-SP0
Python version
Python 3.12.6
Facing the same issue on mac as well despite including specifying the accelerator options.
@glanzz @Convl were any one of you able to resolve this or find a work around this?
@zeerak-wyne-sportsbet i wrote a simple script to convert my large pdf into smaller pdfs with less pages and then converted them.