docling Docling-serve Fails to Utilize GPU Despite Available CUDA Provider

Docling-serve Fails to Utilize GPU Despite Available CUDA Provider

Environment:

OS: Windows with Docker Desktop (WSL 2 backend)
GPU: NVIDIA GeForce RTX 5070
NVIDIA Driver: 581.57
CUDA Version (reported by driver): 13.0

Goal:

Run docling-serve within Docker, utilizing the NVIDIA GPU for acceleration (e.g., for OCR, VLM).

Problem Description:

Attempts to run docling-serve with GPU support have encountered several issues leading to the GPU not being utilized, even when the container environment seems correctly configured.

Steps Taken & Issues Encountered:

Image ghcr.io/docling-project/docling-serve:latest: Container runs, but uses CPU only, as expected.
Image ghcr.io/docling-project/docling-serve-cu126: Container fails during job processing with torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device. This is likely due to incompatibility between the image's CUDA 12.6 build and the host's CUDA 13.0 capability.
Image ghcr.io/docling-project/docling-serve-cu128:
- Container starts and nvidia-smi inside the container correctly identifies the GPU.
- However, nvidia-smi shows N/A for GPU Memory Usage for the main python process (PID 1), even during active jobs.
- Investigation revealed that this image incorrectly contains the onnxruntime (CPU) package instead of onnxruntime-gpu. The available providers list confirmed only CPUExecutionProvider was present.
Custom Image Build (Workaround):
- Created a Dockerfile based on ghcr.io/docling-project/docling-serve-cu128.
- Added steps to USER root, pip uninstall -y onnxruntime, pip install --no-cache-dir onnxruntime-gpu, and USER 1001.
- Built and ran this custom image using Docker Compose.
- Verification inside the container (using docker exec):
  - pip show onnxruntime onnxruntime-gpu confirms onnxruntime-gpu is installed and onnxruntime is removed. ✅
  - python -c "import onnxruntime as ort; print(ort.get_available_providers())" outputs ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'], confirming the CUDA provider is now available. ✅
- Persistent Issue: Despite the corrected setup and available CUDA provider, running watch -n 1 nvidia-smi inside the container during an active job still shows N/A GPU Memory Usage and 0% GPU-Util for the main python process (PID 1).

Configuration Snippets:

docker-compose.yml (relevant service):

  docling:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: docling
    restart: unless-stopped
    ports:
      - "5011:5001"
    environment:
      - DOCLING_SERVE_ENABLE_UI=true
      - DOCLING_SERVE_MAX_SYNC_WAIT=1200
      - DOCLING_SERVE_ARTIFACTS_PATH=/app/models
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - caddy_network
    volumes:
      - doclingdata:/app/data
      - doclingmodels:/app/models # Pre-populated using cp from base image cache

Dockerfile:

FROM ghcr.io/docling-project/docling-serve-cu128

USER root

RUN pip uninstall -y onnxruntime && \
    pip install --no-cache-dir onnxruntime-gpu

USER 1001

Conclusion:

Even with a seemingly correct environment (GPU visible, correct onnxruntime-gpu installed, CUDA provider available), docling-serve appears to default to the CPU provider and does not utilize the GPU. This suggests an issue within Docling's logic for selecting the ONNX Runtime execution provider.

Oct 27 '25 07:10 walterkru

Hi @walterkru! I'm Dosu and I’m helping the docling team.

Your environment and custom image setup look correct, but docling-serve's actual use of the GPU depends on both the ONNX Runtime provider selection logic and the compatibility of all required CUDA extensions for the models being run.

In the code, provider selection for ONNX Runtime is dynamic: it tries to use GPU providers if available, but will fall back to CPU if the GPU provider is unavailable or incompatible. This logic is implemented in the VLM (Vision-Language Model) ONNX integration layer. If the CUDA provider is listed but not actually used during inference, it often means a required CUDA extension (like flash-attn for some Huggingface models) is missing or incompatible, which silently triggers a fallback to CPU—even though the provider appears available at the Python level.

Recent issues and user reports confirm that missing or mismatched flash-attn builds are a common cause of this silent fallback, especially with newer GPUs and CUDA versions. The official docling-serve images have had problems with flash-attn compatibility, and users have worked around this by manually installing a prebuilt flash-attn wheel that matches their CUDA, torch, and Python versions inside the container. When this is done, GPU utilization works as expected for supported models (source, source).

Actionable steps:

After ensuring onnxruntime-gpu is installed and the CUDA provider is available, check if your models require flash-attn or other CUDA extensions. If so, install a compatible flash-attn wheel inside your container. Prebuilt wheels are available for many CUDA versions, or you may need to build from source for the latest hardware.
Monitor logs for any warnings or errors about CUDA extension loading or provider fallback. If possible, add debug logging to the ONNX session creation code to confirm which provider is actually used at runtime.
If you still see no GPU utilization, try running a minimal ONNX inference script inside the container to isolate whether the issue is with the environment or with docling-serve's model/provider logic.

This is a common pain point with the current ecosystem, especially for new GPUs and CUDA versions, and is not unique to docling-serve. Let me know if you need pointers for installing flash-attn or for adding debug logging to confirm provider usage.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Oct 27 '25 07:10 dosubot[bot]

TL;DR: You need to Install CUDA toolkit 12.8 for the image docling-serve-cu128 to be able to enable CUDA accelaration.

The nvidia-smi doesn't provide the information of which version of CUDA toolkit version is installed. This command is related to NVIDIA driver and what is the highest version of CUDA it supports. You need to run nvcc --version for checking which version of CUDA toolkit is installed in your container.

Now, the issue here seems to be related to supported CUDA toolkit version. According to the official documentations docling-serve-cu128 image is based on CUDA toolkit 12.8 - https://github.com/docling-project/docling-serve/pkgs/container/docling-serve

I also faced the same issue, where my CUDA toolkit version were 13.0 and docling was not detecting CUDA, thus defaulting to CPU acccelaration. Uninstalling the toolkit 13.0 and installing the toolkit 12.8 resolved my issue.

Nov 07 '25 06:11 erfanul007

i cannot nvidia-smi in the container and unable to utilise the GPU on K8s. Is there a guide where I can refer?

Nov 13 '25 07:11 elstoneng

I've got the same issue, need help !!!

Nov 17 '25 22:11 hisan-ideamaker