llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

ggml_cuda_init: failed to initialize CUDA: (null) on Windows with CUDA 12.9

Open sequeirawilson2021 opened this issue 3 months ago • 2 comments

System Information:

  • OS: Windows
  • GPU: NVIDIA GeForce RTX 5060 Ti
  • NVIDIA Driver Version: 577.00
  • CUDA Version (from nvidia-smi): 12.9
  • Python Version: 3.12
  • Visual Studio: Visual Studio 2019 with "Desktop development with C++" workload

Problem Description: I am unable to get llama-cpp-python to use my GPU. When I run a script to load a model with n_gpu_layers=-1, I get the error ggml_cuda_init: failed to initialize CUDA: (null), and all layers are loaded on the CPU.

Troubleshooting Steps Taken:

  1. Installed llama-cpp-python using the following command in the "x64 Native Tools Command Prompt for VS 2019" with a Python virtual environment activated: 1 set CMAKE_ARGS="-DGGML_CUDA=on" && pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
  2. Verified that the command completes successfully, but the resulting installation does not use the GPU.
  3. Tried using the deprecated LLAMA_CUBLAS flag, which resulted in a build error (as expected).
  4. Performed a full cleanup of the environment:
    • pip uninstall llama-cpp-python
    • pip cache purge
    • Manually deleted leftover ~* directories from site-packages.
  5. Reinstalled after the cleanup, but the problem persists.
  6. Installed PyTorch with CUDA 12.1 support (pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121) before reinstalling llama-cpp-python, but this did not resolve the issue.
  7. Confirmed that the correct Python interpreter and virtual environment are being used.
  8. The run_with_llama_cpp.py script being used is:
1     from llama_cpp import Llama
2 
3     llm = Llama(
4       model_path="models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
5       n_gpu_layers=-1,
6       n_ctx=4096,
7       verbose=True
8     )
9 

10 output = llm( 11 "AI is going to ", 12 max_tokens=32, 13 stop=["."], 14 echo=True 15 ) 16 17 print(output)

Request: Could you please provide any insights into why the CUDA initialization might be failing, or suggest any further diagnostic steps? I can provide the full verbose build log if needed.

sequeirawilson2021 avatar Aug 31 '25 12:08 sequeirawilson2021