Docling getting killed when i feed bigger pdf files which have 900+ pages
It doesnt use my full 8 GB VRAM of RTX 4060 (only uses 3.8GB VRAM ) and after some time, it starts consuming alot of my system ram and my cpu consumption rises by big margin and it kills the process itself when ram consumption reaches 95% .
So why is this happening ? and is there any way to fix this , so that the docling process, only use 7GB VRAM and simultaneously starts creating markdown instead of consuming my every possible resource and ultimately die :-((
Not an actual solution but a suggested workaround. Most likely this is just OS killing process for OOM error, you could split the pdf files after a maximum e.g 500 pages and join result at the end. You could automate this with libraries like pypdf of pymupdf as a first step before parsing with docling.
An example with pypdf
from pypdf import PdfReader, PdfWriter
def split_pdf_by_page_limit(input_pdf_path, output_folder, max_pages_per_file=500):
reader = PdfReader(input_pdf_path)
total_pages = len(reader.pages)
num_files = (total_pages + max_pages_per_file - 1) // max_pages_per_file
for i in range(num_files):
writer = PdfWriter()
start_page = i * max_pages_per_file
end_page = min((i + 1) * max_pages_per_file, total_pages)
for page_num in range(start_page, end_page):
writer.add_page(reader.pages[page_num])
output_filename = f"split_part_{i + 1}.pdf"
output_path = f"{output_folder}/{output_filename}"
with open(output_path, "wb") as output_file:
writer.write(output_file)
@Greatz08 For large pdf's, it might be good (depending on your machine) to convert using selected page-ranges (1-100, 101-200, etc). We have specific parameters for this in the pipeline.
@Greatz08 For large pdf's, it might be good (depending on your machine) to convert using selected page-ranges (1-100, 101-200, etc). We have specific parameters for this in the pipeline.
Can you pls let me know which parameter are those
@Greatz08对于较大的 PDF 文件,根据你的机器,使用指定的页码范围(1-100、101-200 等)进行转换可能比较好。我们在流程中为此准备了具体的参数。
It's best if you can give a case.
@harinisri2001
conv_res = doc_converter.convert(source=file_path, page_range=[1, 100])
@ColeDrain Thank you! 🤗
@AI091 @PeterStaar-IBM thanks, will try soon