VLM pipeline hangs indefinitely during document processing with Transformers backend on NVIDIA GPU
VLM Pipeline Hangs Indefinitely During Document Processing
Bug
The VLM pipeline with Granite-Docling hangs indefinitely during document processing on NVIDIA RTX 3090 GPU. The process gets stuck at the "Processing document..." stage and never completes, even after 30+ minutes. This occurs with both multi-page and single-page PDFs.
The standard pipeline works perfectly fine on the same documents, completing in ~126 seconds for a 10-page PDF.
Symptoms:
- Process hangs after log message:
INFO - Processing document [filename].pdf - No further progress, no errors, no completion
- GPU shows activity (model is loaded on CUDA) but generation appears stuck
- Reproducible 100% of the time with VLM pipeline
- Standard pipeline works without issues
Last log output before hanging:
2025-10-15 17:05:42,177 - INFO - Going to convert document batch...
2025-10-15 17:05:42,178 - INFO - Initializing pipeline for VlmPipeline with options hash 14b35a24912cc09d5c7735b8ff9d88c1
2025-10-15 17:05:42,351 - INFO - Loading plugin 'docling_defaults'
2025-10-15 17:05:42,354 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-15 17:05:42,564 - INFO - Accelerator device: 'cuda:0'
2025-10-15 17:06:25,701 - INFO - Processing document 3HANS-1.pdf
[HANGS HERE - no further output]
Steps to reproduce
Test 1: Using CLI (simplest reproduction)
# This hangs indefinitely
docling test_document.pdf --pipeline vlm --to md --output ./output_vlm --device cuda -v
# This works fine (completes in ~126s for 10-page PDF)
docling test_document.pdf --to md --output ./output_standard --device cuda -v
Test 2: Using Python API
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
),
}
)
# Hangs at convert() call
result = converter.convert(source="test_document.pdf")
Test 3: Even single-page PDFs hang
# Created 1-page test PDF - still hangs
docling single_page.pdf --pipeline vlm --to md --output ./test --device cuda -v
Test 4: Attempted workaround with reduced tokens
from docling.datamodel import vlm_model_specs
import copy
# Modified max_new_tokens from default 8192 to 1024
custom_spec = copy.deepcopy(vlm_model_specs.GRANITEDOCLING_TRANSFORMERS)
custom_spec.max_new_tokens = 1024
# Still hangs
Environment Details
Docling version:
Docling version: 2.56.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
Python: cpython-312 (3.12.9)
Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version:
Python 3.12.9
System Information:
- OS: Linux (WSL2)
- Kernel: 6.6.87.2-microsoft-standard-WSL2
- GPU: NVIDIA GeForce RTX 3090 (24GB VRAM)
- CUDA Version: 12.8
- PyTorch: 2.8.0+cu128
- CUDA Available: True
Verification that GPU is working:
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'
>>> torch.__version__
'2.8.0+cu128'
Relevant package versions:
torch 2.8.0+cu128
transformers 4.56.2
Related Issues
This appears similar to:
- "docling-serve Inference takes forever when using pipeline"
- Difference: That issue is on docling-serve with VLLM backend, this is direct docling CLI/API with Transformers backend
Expected Behavior
VLM pipeline should process the document and complete within a reasonable timeframe (similar to or slightly longer than standard pipeline).
Actual Behavior
VLM pipeline hangs indefinitely with no progress, errors, or completion after the "Processing document..." log message.
Additional Context
- The standard pipeline completes successfully on the same hardware and documents
- GPU memory is allocated (~3-4GB during hanging state based on nvidia-smi)
- No error messages or exceptions are thrown
- Process must be killed with Ctrl+C or timeout
- Reproduced across multiple PDF documents (both simple and complex)
Workarounds Attempted
- ✗ Reduced
max_new_tokensfrom 8192 to 1024, 512 - still hangs - ✗ Single-page PDF instead of multi-page - still hangs
- ✗ Explicit CUDA device specification - still hangs
- ✗ Different VLM models (tried SmolDocling) - still hangs
- ✓ Using standard pipeline instead of VLM - works perfectly
Request
Please advise of how I can resolve VLM generation getting stuck during inference on my 3090 with Transformers backend. Happy to provide additional logs, test cases, or debugging information as needed.
There are no documented issues or errors with the VLM pipeline hanging on NVIDIA GPUs with the Transformers backend, but recent changes to the inference code and generation argument handling could be related. The VLM pipeline uses the HuggingFaceTransformersVlmModel for inference, and recent updates filter generation and decoder arguments separately, which may affect how generation proceeds on CUDA devices. The model is loaded and compiled with torch.compile, and generation configuration is loaded with GenerationConfig.from_pretrained, including a new revision parameter. If the process hangs after "Processing document...", it's likely blocking on model.generate or decoding, possibly due to incompatible generation parameters or a silent error in the model call.
To debug further, try running the VLM pipeline with the CPU backend to see if the hang is specific to CUDA. Also, enable verbose logging if available, and isolate the model call in a minimal script to check if generation completes outside the pipeline. If the issue persists, consider testing with different generation parameters (like temperature, stop strings, or batch size) and reviewing the recent changes to generation argument handling for potential incompatibilities with your hardware or environment.
Recent PRs with changes to the VLM pipeline and Transformers backend may be relevant for further investigation: PR #2322.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Im having the same issue, and Im using a H100 gpu to run it, so HW/resources are definitely not the problem here
I have the same issue, ran on google colab t4 gpu, and m1 pro
As a work around, I got it working via vllm.
https://github.com/docling-project/docling/blob/main/docs%2Fusage%2Fgpu.md#L71