llama-cpp-python GPU Support Missing in Version >=0.3.5 on Windows with CUDA 12.4 and RTX 3090

Issue Description:

I'm experiencing a discrepancy between version 0.3.4 and later versions (>=0.3.5) regarding GPU utilization:

Version 0.3.4 (Prebuilt Wheel): The prebuilt wheel for 0.3.4 loads the model onto the GPU; however, it's not compatible with phi4.

Version >=0.3.5: There are no prebuilt wheels available for these versions, and when building from source, only the CPU is being used—the model does not load onto the GPU.

System Details:

Operating System: Windows 11 CUDA Version: 12.4 GPU: RTX 3090 24GB Steps Taken:

Installed version 0.3.4 via the prebuilt wheel – confirmed GPU loading (but phi4 incompatibility remains). Upgraded to version 0.3.5 (and above) by building from source with CUDA support enabled. Verified that the build settings include -DGGML_CUDA=on and confirmed that the system has CUDA 12.4 installed. Despite these configurations, the build defaults to CPU usage, and the model never loads onto the GPU. Could you please advise on whether this is an expected behavior for versions >=0.3.5, or if there might be an issue with GPU detection/configuration on Windows 11 with CUDA 12.4? Any guidance or troubleshooting steps to enable GPU support for these versions would be greatly appreciated.

Mar 09 '25 09:03 mcglynnfinn

Maybe you can try my new prebuilt: https://github.com/JamePeng/llama-cpp-python/releases

Mar 09 '25 11:03 JamePeng

hey @mcglynnfinn Try installing the library using the below command CMAKE_ARGS="-DGGML_CUDA=ON -DLLAMA_LLAVA=OFF" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Apr 18 '25 05:04 AleefBilal