docling icon indicating copy to clipboard operation
docling copied to clipboard

Docling getting killed when i feed bigger pdf files which have 900+ pages

Open Greatz08 opened this issue 7 months ago • 6 comments

It doesnt use my full 8 GB VRAM of RTX 4060 (only uses 3.8GB VRAM ) and after some time, it starts consuming alot of my system ram and my cpu consumption rises by big margin and it kills the process itself when ram consumption reaches 95% .

So why is this happening ? and is there any way to fix this , so that the docling process, only use 7GB VRAM and simultaneously starts creating markdown instead of consuming my every possible resource and ultimately die :-((

Greatz08 avatar May 24 '25 13:05 Greatz08

Not an actual solution but a suggested workaround. Most likely this is just OS killing process for OOM error, you could split the pdf files after a maximum e.g 500 pages and join result at the end. You could automate this with libraries like pypdf of pymupdf as a first step before parsing with docling.

An example with pypdf

from pypdf import PdfReader, PdfWriter

def split_pdf_by_page_limit(input_pdf_path, output_folder, max_pages_per_file=500):
    reader = PdfReader(input_pdf_path)
    total_pages = len(reader.pages)
    num_files = (total_pages + max_pages_per_file - 1) // max_pages_per_file

    for i in range(num_files):
        writer = PdfWriter()
        start_page = i * max_pages_per_file
        end_page = min((i + 1) * max_pages_per_file, total_pages)

        for page_num in range(start_page, end_page):
            writer.add_page(reader.pages[page_num])

        output_filename = f"split_part_{i + 1}.pdf"
        output_path = f"{output_folder}/{output_filename}"

        with open(output_path, "wb") as output_file:
            writer.write(output_file)

AI091 avatar May 25 '25 16:05 AI091

@Greatz08 For large pdf's, it might be good (depending on your machine) to convert using selected page-ranges (1-100, 101-200, etc). We have specific parameters for this in the pipeline.

PeterStaar-IBM avatar May 26 '25 05:05 PeterStaar-IBM

@Greatz08 For large pdf's, it might be good (depending on your machine) to convert using selected page-ranges (1-100, 101-200, etc). We have specific parameters for this in the pipeline.

Can you pls let me know which parameter are those

harinisri2001 avatar May 30 '25 07:05 harinisri2001

@Greatz08对于较大的 PDF 文件,根据你的机器,使用指定的页码范围(1-100、101-200 等)进行转换可能比较好。我们在流程中为此准备了具体的参数。

It's best if you can give a case.

tzyyy avatar May 30 '25 08:05 tzyyy

@harinisri2001 conv_res = doc_converter.convert(source=file_path, page_range=[1, 100])

ColeDrain avatar May 30 '25 12:05 ColeDrain

@ColeDrain Thank you! 🤗

PeterStaar-IBM avatar May 30 '25 13:05 PeterStaar-IBM

@AI091 @PeterStaar-IBM thanks, will try soon

Greatz08 avatar Jun 04 '25 03:06 Greatz08