CUDA-ooM with large PDFs

🐛 Describe the bug

Component: Pipeline (pipeline.py) Version: Latest as of February 27, 2025 (assumed from git clone) Environment: OS: Ubuntu

Python: 3.12


CUDA: 12.4

SGLang: 0.4.2

PyTorch: Installed via olmocr dependencies

Description: When processing large PDF files (e.g., tests/2.pdf with 4180 pages), the olmocr.pipeline module excessively allocates CUDA memory (VRAM) during the preparation phase, leading to a torch.OutOfMemoryError. This occurs even when parameters like --pages_per_group and --workers are set to limit the number of pages processed simultaneously. The issue appears to stem from the pipeline loading or rendering all pages into CUDA memory at once, rather than respecting the batch size defined by --pages_per_group. Steps to Reproduce: Set up the olmocr environment:

git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Start the SGLang server:

python -m sglang.launch_server --model-path allenai/olmOCR-7B-0225-preview --port 30024

Run the pipeline with a large PDF (e.g., 4180 pages):

python -m olmocr.pipeline ./localworkspace --pdfs tests/2.pdf --model allenai/olmOCR-7B-0225-preview --pages_per_group 10 --workers 1

Monitor VRAM usage:

watch -n 1 nvidia-smi

Actual Result:

The pipeline process allocates excessive CUDA memory (~3 GiB observed, growing with PDF size) before sending data to the server. Logs show Got 4180 pages to do for tests/2.pdf in worker 0, indicating all pages are prepared at once, ignoring --pages_per_group 10. VRAM fills up (e.g., 20.45 GiB by server + 3.18 GiB by pipeline), triggering:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 111.12 MiB is free.

Expected Result: The pipeline should respect --pages_per_group 10, preparing and processing only 10 pages at a time in CUDA memory. VRAM usage by the pipeline should remain minimal (e.g., <1-2 GiB for 10 pages), allowing large PDFs (1000+ pages) to be processed without OOM errors.

Additional Information: Pipeline VRAM usage spikes during PDF preparation, before server inference begins (server logs show no significant activity beyond initial requests). Setting CUDA_VISIBLE_DEVICES="" prevents CUDA usage but crashes the pipeline with RuntimeError: No CUDA GPUs are available, indicating a hard dependency on CUDA.

Suggested Fix: Modify pipeline.py to incrementally load and render PDF pages in batches defined by --pages_per_group, avoiding loading all pages into CUDA memory at once. Optionally, allow CPU-only rendering as a fallback (remove or make optional the GPU check in check.py:38).

Logs: 2025-02-27 01:41:39,217 - main - INFO - Got 4180 pages to do for tests/2.pdf in worker 0 [...]

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 111.12 MiB is free. Process 34417 has 20.45 GiB memory in use. Including non-PyTorch memory, this process has 3.18 GiB memory in use.


aiohappyeyeballs==2.4.6 aiohttp==3.11.13 aiosignal==1.3.2 annotated-types==0.7.0 anthropic==0.47.2 anyio==4.8.0 asttokens==3.0.0 attrs==25.1.0 beaker-py==1.34.1 bitsandbytes==0.45.3 bleach==6.2.0 boto3==1.37.2 botocore==1.37.2 cached_path==1.6.7 cachetools==5.5.2 certifi==2025.1.31 cffi==1.17.1 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 compressed-tensors==0.8.0 cryptography==44.0.1 cuda-bindings==12.8.0 cuda-python==12.8.0 datasets==3.3.2 decorator==5.2.1 decord==0.6.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 docker==7.1.0 einops==0.8.1 executing==2.2.0 fastapi==0.115.8 filelock==3.17.0 flashinfer==0.1.6+cu124torch2.4 frozenlist==1.5.0 fsspec==2024.12.0 ftfy==6.3.1 fuzzysearch==0.7.3 gguf==0.10.0 google-api-core==2.24.1 google-auth==2.38.0 google-cloud-core==2.4.2 google-cloud-storage==2.19.0 google-crc32c==1.6.0 google-resumable-media==2.7.2 googleapis-common-protos==1.68.0 h11==0.14.0 hf_transfer==0.1.9 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.27.1 idna==3.10 importlib_metadata==8.6.1 interegular==0.3.3 ipython==8.32.0 jedi==0.19.2 Jinja2==3.1.5 jiter==0.8.2 jmespath==1.0.1 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lark==1.2.2 lingua-language-detector==2.0.2 litellm==1.61.17 llvmlite==0.44.0 lm-format-enforcer==0.10.11 markdown-it-py==3.0.0 markdown2==2.5.3 MarkupSafe==3.0.2 matplotlib-inline==0.1.7 mdurl==0.1.2 mistral_common==1.5.3 modelscope==1.23.1 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.1.0 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.4.2 numba==0.61.0 numpy==1.26.4 nvidia-cublas-cu12== nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12== nvidia-cufft-cu12== nvidia-curand-cu12== nvidia-cusolver-cu12== nvidia-cusparse-cu12== nvidia-cusparselt-cu12==0.6.2 nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 -e git+https://github.com/allenai/olmocr.git@bd08fdb4761538c96224ace9e951e5d956589790#egg=olmocr openai==1.64.0 opencv-python-headless== orjson==3.10.15 outlines==0.0.46 packaging==24.2 pandas==2.2.3 parso==0.8.4 partial-json-parser== pexpect==4.9.0 pillow==11.1.0 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 prompt_toolkit==3.0.50 propcache==0.3.0 proto-plus==1.26.0 protobuf==5.29.3 psutil==7.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==19.0.1 pyasn1==0.6.1 pyasn1_modules==0.4.1 pycountry==24.6.1 pycparser==2.22 pydantic==2.10.6 pydantic_core==2.27.2 Pygments==2.19.1 pypdf==5.3.0 pypdfium2==4.30.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.20 pytz==2025.1 PyYAML==6.0.2 pyzmq==26.2.1 RapidFuzz==3.12.1 ray==2.42.1 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rpds-py==0.23.1 rsa==4.9 s3transfer==0.11.3 safetensors==0.5.3 sentencepiece==0.2.0 sequence_align==0.2.0 setproctitle==1.3.5 setuptools==75.8.2 sgl-kernel==0.0.3.post1 sglang==0.4.2 six==1.17.0 smart-open==7.1.0 sniffio==1.3.1 stack-data==0.6.3 starlette==0.45.3 sympy==1.13.1 tiktoken==0.9.0 tokenizers==0.21.0 torch==2.5.1 torchao==0.8.0 torchvision==0.20.1 tqdm==4.67.1 traitlets==5.14.3 transformers==4.49.0 triton==3.1.0 typing_extensions==4.12.2 tzdata==2025.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 vllm==0.6.4.post1 watchfiles==1.0.4 wcwidth==0.2.13 webencodings==0.5.1 websockets==15.0 wrapt==1.17.2 xformers==0.0.28.post3 xgrammar==0.1.14 xxhash==3.5.0 yarl==1.18.3 zipp==3.21.0 zstandard==0.23.0

