VLM Pipeline Hangs Indefinitely During Document Processing

Bug

The VLM pipeline with Granite-Docling hangs indefinitely during document processing on NVIDIA RTX 3090 GPU. The process gets stuck at the "Processing document..." stage and never completes, even after 30+ minutes. This occurs with both multi-page and single-page PDFs.

The standard pipeline works perfectly fine on the same documents, completing in ~126 seconds for a 10-page PDF.

Symptoms:

Process hangs after log message: INFO - Processing document [filename].pdf
No further progress, no errors, no completion
GPU shows activity (model is loaded on CUDA) but generation appears stuck
Reproducible 100% of the time with VLM pipeline
Standard pipeline works without issues

Last log output before hanging:

2025-10-15 17:05:42,177 - INFO - Going to convert document batch...
2025-10-15 17:05:42,178 - INFO - Initializing pipeline for VlmPipeline with options hash 14b35a24912cc09d5c7735b8ff9d88c1
2025-10-15 17:05:42,351 - INFO - Loading plugin 'docling_defaults'
2025-10-15 17:05:42,354 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-15 17:05:42,564 - INFO - Accelerator device: 'cuda:0'
2025-10-15 17:06:25,701 - INFO - Processing document 3HANS-1.pdf
[HANGS HERE - no further output]

Steps to reproduce

Test 1: Using CLI (simplest reproduction)

# This hangs indefinitely
docling test_document.pdf --pipeline vlm --to md --output ./output_vlm --device cuda -v

# This works fine (completes in ~126s for 10-page PDF)
docling test_document.pdf --to md --output ./output_standard --device cuda -v

Test 2: Using Python API

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
        ),
    }
)

# Hangs at convert() call
result = converter.convert(source="test_document.pdf")

Test 3: Even single-page PDFs hang

# Created 1-page test PDF - still hangs
docling single_page.pdf --pipeline vlm --to md --output ./test --device cuda -v

Test 4: Attempted workaround with reduced tokens

from docling.datamodel import vlm_model_specs
import copy

# Modified max_new_tokens from default 8192 to 1024
custom_spec = copy.deepcopy(vlm_model_specs.GRANITEDOCLING_TRANSFORMERS)
custom_spec.max_new_tokens = 1024

# Still hangs

Environment Details

Docling version:

Docling version: 2.56.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0
Python: cpython-312 (3.12.9)
Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35

Python version:

Python 3.12.9

System Information:

OS: Linux (WSL2)
Kernel: 6.6.87.2-microsoft-standard-WSL2
GPU: NVIDIA GeForce RTX 3090 (24GB VRAM)
CUDA Version: 12.8
PyTorch: 2.8.0+cu128
CUDA Available: True

Verification that GPU is working:

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'
>>> torch.__version__
'2.8.0+cu128'

Relevant package versions:

torch                2.8.0+cu128
transformers         4.56.2

Related Issues

This appears similar to:

"docling-serve Inference takes forever when using pipeline"
Difference: That issue is on docling-serve with VLLM backend, this is direct docling CLI/API with Transformers backend

Expected Behavior

VLM pipeline should process the document and complete within a reasonable timeframe (similar to or slightly longer than standard pipeline).

Actual Behavior

VLM pipeline hangs indefinitely with no progress, errors, or completion after the "Processing document..." log message.

Additional Context

The standard pipeline completes successfully on the same hardware and documents
GPU memory is allocated (~3-4GB during hanging state based on nvidia-smi)
No error messages or exceptions are thrown
Process must be killed with Ctrl+C or timeout
Reproduced across multiple PDF documents (both simple and complex)

Workarounds Attempted

✗ Reduced max_new_tokens from 8192 to 1024, 512 - still hangs
✗ Single-page PDF instead of multi-page - still hangs
✗ Explicit CUDA device specification - still hangs
✗ Different VLM models (tried SmolDocling) - still hangs
✓ Using standard pipeline instead of VLM - works perfectly

Request

Please advise of how I can resolve VLM generation getting stuck during inference on my 3090 with Transformers backend. Happy to provide additional logs, test cases, or debugging information as needed.

Oct 15 '25 18:10 NasonZ

There are no documented issues or errors with the VLM pipeline hanging on NVIDIA GPUs with the Transformers backend, but recent changes to the inference code and generation argument handling could be related. The VLM pipeline uses the HuggingFaceTransformersVlmModel for inference, and recent updates filter generation and decoder arguments separately, which may affect how generation proceeds on CUDA devices. The model is loaded and compiled with torch.compile, and generation configuration is loaded with GenerationConfig.from_pretrained, including a new revision parameter. If the process hangs after "Processing document...", it's likely blocking on model.generate or decoding, possibly due to incompatible generation parameters or a silent error in the model call.

To debug further, try running the VLM pipeline with the CPU backend to see if the hang is specific to CUDA. Also, enable verbose logging if available, and isolate the model call in a minimal script to check if generation completes outside the pipeline. If the issue persists, consider testing with different generation parameters (like temperature, stop strings, or batch size) and reviewing the recent changes to generation argument handling for potential incompatibilities with your hardware or environment.

Recent PRs with changes to the VLM pipeline and Transformers backend may be relevant for further investigation: PR #2322.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}