llama-cpp-python CUDA wheel installs, but GPU is never used on Windows 11 (Python 3.11, CUDA 12.1, torch finds GPU)

On Windows 11, with Python 3.11 and a CUDA 12.1-compatible NVIDIA GPU, I can successfully install llama-cpp-python via pip from the cu121 wheel, but no matter what, all model layers are always assigned to CPU only.

torch works and finds the GPU (torch.cuda.is_available() == True).

All paths and DLLs appear correct.

No errors, just silent fallback to CPU for every layer, even with n_gpu_layers set. Tried -1 and 32

I have confirmed this across multiple clean venvs and installs.

System Info

OS: Windows 11 Pro

GPU: NVIDIA RTX 4060 Ti, 16 GB VRAM

NVIDIA Driver Version: NVIDIA-SMI, 581.57

CUDA Toolkit Version: 12.1 (from PATH and NVIDIA tools)

Python Version: 3.11.x

llama-cpp-python Version: 0.3.16 (from pip show llama-cpp-python)

Installation Command:

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

Model Used: Mistral 7B Q5_K GGUF (4.7GB)

What I Tried

Fresh venv (python -m venv envgpu311)

Confirmed correct Python version: where python → E:\envgpu311\Scripts\python.exe

pip list shows torch 2.5.1+cu121, llama-cpp-python 0.3.16

System PATH contains correct CUDA 12.1 bin directory

cudart64_121.dll is present in CUDA bin

Test script:

from llama_cpp import Llama llm = Llama(model_path="E:/Python/Models/Mistral/lucian_model.gguf", n_ctx=512, n_gpu_layers=32, verbose=True) print(llm("Hello, world!"))

Logs never mention CUDA or GPU, only assigned to device CPU for every layer.

torch tests pass:

import torch print(torch.version) # 2.5.1+cu121 print(torch.cuda.is_available()) # True

Tried uninstalling all other venvs, cleaning PATH, and purging pip cache.

Also tried n_gpu_layers=-1 and 32 as stated above

What Happens

Model loads, runs, and produces output—but all layers go to CPU, no GPU assignment or CUDA logs, and performance matches CPU.

What I Expected

Layers assigned to GPU, CUDA logs in output, and at least some portion of the model running on GPU

Other Details

No error messages, just silent fallback to CPU.

Issue persists across all wheels and source builds tested.

pip show llama-cpp-python:

Name: llama_cpp_python Version: 0.3.16 Location: E:\envgpu311\Lib\site-packages Requires: diskcache, jinja2, numpy, typing-extensions

torch and other CUDA tools/apps detect GPU fine.

Latest NVIDIA driver installed 10/15/2025

Oct 16 '25 02:10 feather528project

Use a fork that provides prebuilt Windows wheels, such as JamePeng/llama-cpp-python

pip install https://github.com/JamePeng/llama-cpp-python/releases/download/v0.3.16-cu128-AVX2-win-20251007/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl

The original llama-cpp-python repository does not provide Windows wheels.
You can confirm this on the releases page abetlen/llama-cpp-python/releases or wheels page - the file names include the Python version and OS.

To verify the installation

python -c "import llama_cpp; print(llama_cpp.llama_print_system_info())"

# Example output for CPU support:
# CPU : SSE3 = 1 | SSSE3 = 1 | ...

# Example output for CUDA support:
# ggml_cuda_init: found 1 CUDA devices: ...
# CUDA : ARCHS = 700,750,800,860,890,900 | FORCE_MMQ = 1 ...

Oct 16 '25 06:10 sergey21000

its pip install https://github.com/JamePeng/llama-cpp-python/releases/download/v0.3.16-cu128-AVX2-win-20251023/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl

Oct 29 '25 08:10 ShoelessTom

It seems it also wrongly states (see below) that pre-built wheels for CUDA 12.5 exist when this actually shows 404: https://abetlen.github.io/llama-cpp-python/whl/cu125/llama-cpp-python/

Latest pre-built wheel available only for CUDA 12.4: https://abetlen.github.io/llama-cpp-python/whl/cu124/llama-cpp-python/

Oct 30 '25 21:10 rudolphos

llama-cpp-python llama-cpp-python copied to clipboard

CUDA wheel installs, but GPU is never used on Windows 11 (Python 3.11, CUDA 12.1, torch finds GPU)

llama-cpp-python
llama-cpp-python copied to clipboard