Docling-serve Fails to Utilize GPU Despite Available CUDA Provider
Docling-serve Fails to Utilize GPU Despite Available CUDA Provider
Environment:
- OS: Windows with Docker Desktop (WSL 2 backend)
- GPU: NVIDIA GeForce RTX 5070
- NVIDIA Driver: 581.57
- CUDA Version (reported by driver): 13.0
Goal:
Run docling-serve within Docker, utilizing the NVIDIA GPU for acceleration (e.g., for OCR, VLM).
Problem Description:
Attempts to run docling-serve with GPU support have encountered several issues leading to the GPU not being utilized, even when the container environment seems correctly configured.
Steps Taken & Issues Encountered:
- Image
ghcr.io/docling-project/docling-serve:latest: Container runs, but uses CPU only, as expected. - Image
ghcr.io/docling-project/docling-serve-cu126: Container fails during job processing withtorch.AcceleratorError: CUDA error: no kernel image is available for execution on the device. This is likely due to incompatibility between the image's CUDA 12.6 build and the host's CUDA 13.0 capability. - Image
ghcr.io/docling-project/docling-serve-cu128:- Container starts and
nvidia-smiinside the container correctly identifies the GPU. - However,
nvidia-smishowsN/Afor GPU Memory Usage for the main python process (PID 1), even during active jobs. - Investigation revealed that this image incorrectly contains the
onnxruntime(CPU) package instead ofonnxruntime-gpu. The available providers list confirmed onlyCPUExecutionProviderwas present.
- Container starts and
- Custom Image Build (Workaround):
- Created a
Dockerfilebased onghcr.io/docling-project/docling-serve-cu128. - Added steps to
USER root,pip uninstall -y onnxruntime,pip install --no-cache-dir onnxruntime-gpu, andUSER 1001. - Built and ran this custom image using Docker Compose.
- Verification inside the container (using
docker exec):pip show onnxruntime onnxruntime-gpuconfirmsonnxruntime-gpuis installed andonnxruntimeis removed. ✅python -c "import onnxruntime as ort; print(ort.get_available_providers())"outputs['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'], confirming the CUDA provider is now available. ✅
- Persistent Issue: Despite the corrected setup and available CUDA provider, running
watch -n 1 nvidia-smiinside the container during an active job still showsN/AGPU Memory Usage and 0% GPU-Util for the main python process (PID 1).
- Created a
Configuration Snippets:
docker-compose.yml(relevant service):docling: build: context: . dockerfile: Dockerfile container_name: docling restart: unless-stopped ports: - "5011:5001" environment: - DOCLING_SERVE_ENABLE_UI=true - DOCLING_SERVE_MAX_SYNC_WAIT=1200 - DOCLING_SERVE_ARTIFACTS_PATH=/app/models - NVIDIA_VISIBLE_DEVICES=all deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] networks: - caddy_network volumes: - doclingdata:/app/data - doclingmodels:/app/models # Pre-populated using cp from base image cacheDockerfile:FROM ghcr.io/docling-project/docling-serve-cu128 USER root RUN pip uninstall -y onnxruntime && \ pip install --no-cache-dir onnxruntime-gpu USER 1001
Conclusion:
Even with a seemingly correct environment (GPU visible, correct onnxruntime-gpu installed, CUDA provider available), docling-serve appears to default to the CPU provider and does not utilize the GPU. This suggests an issue within Docling's logic for selecting the ONNX Runtime execution provider.
Hi @walterkru! I'm Dosu and I’m helping the docling team.
Your environment and custom image setup look correct, but docling-serve's actual use of the GPU depends on both the ONNX Runtime provider selection logic and the compatibility of all required CUDA extensions for the models being run.
In the code, provider selection for ONNX Runtime is dynamic: it tries to use GPU providers if available, but will fall back to CPU if the GPU provider is unavailable or incompatible. This logic is implemented in the VLM (Vision-Language Model) ONNX integration layer. If the CUDA provider is listed but not actually used during inference, it often means a required CUDA extension (like flash-attn for some Huggingface models) is missing or incompatible, which silently triggers a fallback to CPU—even though the provider appears available at the Python level.
Recent issues and user reports confirm that missing or mismatched flash-attn builds are a common cause of this silent fallback, especially with newer GPUs and CUDA versions. The official docling-serve images have had problems with flash-attn compatibility, and users have worked around this by manually installing a prebuilt flash-attn wheel that matches their CUDA, torch, and Python versions inside the container. When this is done, GPU utilization works as expected for supported models (source, source).
Actionable steps:
- After ensuring onnxruntime-gpu is installed and the CUDA provider is available, check if your models require flash-attn or other CUDA extensions. If so, install a compatible flash-attn wheel inside your container. Prebuilt wheels are available for many CUDA versions, or you may need to build from source for the latest hardware.
- Monitor logs for any warnings or errors about CUDA extension loading or provider fallback. If possible, add debug logging to the ONNX session creation code to confirm which provider is actually used at runtime.
- If you still see no GPU utilization, try running a minimal ONNX inference script inside the container to isolate whether the issue is with the environment or with docling-serve's model/provider logic.
This is a common pain point with the current ecosystem, especially for new GPUs and CUDA versions, and is not unique to docling-serve. Let me know if you need pointers for installing flash-attn or for adding debug logging to confirm provider usage.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
TL;DR: You need to Install CUDA toolkit 12.8 for the image docling-serve-cu128 to be able to enable CUDA accelaration.
The nvidia-smi doesn't provide the information of which version of CUDA toolkit version is installed. This command is related to NVIDIA driver and what is the highest version of CUDA it supports. You need to run nvcc --version for checking which version of CUDA toolkit is installed in your container.
Now, the issue here seems to be related to supported CUDA toolkit version. According to the official documentations docling-serve-cu128 image is based on CUDA toolkit 12.8 - https://github.com/docling-project/docling-serve/pkgs/container/docling-serve
I also faced the same issue, where my CUDA toolkit version were 13.0 and docling was not detecting CUDA, thus defaulting to CPU acccelaration. Uninstalling the toolkit 13.0 and installing the toolkit 12.8 resolved my issue.
i cannot nvidia-smi in the container and unable to utilise the GPU on K8s. Is there a guide where I can refer?
I've got the same issue, need help !!!