thepipe icon indicating copy to clipboard operation
thepipe copied to clipboard

Local scrape_file failing for some PDFs with out of memory

Open camrail opened this issue 6 months ago • 1 comments

Hi there,

I've just switched over to the local version from the API and I am experiencing a memory issue that exits my celery worker or shell environment when I run scrape_file. When scrape_file runs on some PDFs I see a "OOMKilled": true, for the worker, but I should have plenty of resources allocated with a ~22GB limit.

It reliably fails on some PDFs and reliably succeeds on others.

Here is my setup code

from openai import OpenAI
from thepipe.scraper import scrape_file

client = OpenAI()
            results = scrape_file(
                filepath=local_file_path, openai_client=client, model="gpt-4o"
            )

thepipe-api==1.5.8 litellm==1.61.1

Thank you 🙏

camrail avatar May 30 '25 10:05 camrail

Hi @camrail ,

I've introduced some additional options rescale: float, input_images: bool, and output images: bool into the scraper.scrape_pdf function to ease memory usage (this creates a tradeoff that may reduce scraping accuracy).

rescale: Factor to rescale individual page images from PDFs. Defaults to 1.0 include_input_images: Whether to pass PDF page images to the vision-language model for enhanced and well-formatted markdown results. Defaults to True include_output_images: Whether to return PDF page images in the chunk response. Useful for downstream use with vision-language models after scraping is complete. Defaults to True

Memory usage inflates with longer / high-resolution PDFs, since individual page images are extracted and kept in-memory. For example, a 300 MB PDF file with 400 pages can inflate well beyond 1 GB while being processed with the scraper.

Below are some memory usage benchmarks for an example PDF file:

Image

Furthermore, your memory usage will double if you are running the LLM server on the same machine as thepipe, since the server will unpack and process the data independently. The LLM server itself will also consume up to dozens of GB for LLM inference.

TLDR: Try doing

results = scrape_pdf(
    filepath=local_file_path, openai_client=client, model="gpt-4o", rescale=0.5, include_output_images=False
)

after upgrading your version. This should greatly reduce memory consumption. If you still hit memory limits, please let me know!

emcf avatar Jun 02 '25 21:06 emcf