llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Models loading on CPU instead of GPU after updating version

Open KuraiAI opened this issue 10 months ago • 4 comments

I updated my version because some DeepSeek models were not working loading; after updating, they started loading, but only on CPU. I tried with other older models on my system that used to load on GPU, and they started only loading onto CPU as well. I noticed this line in particular that others have mentioned for the same issue: tensor 'token_embd.weight' (q4_K) (and 322 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead

I downgraded the version to 0.3.6 and it loads onto my GPU now.

I can just use the older version but it would be nice if this gets fixed so that those of us with this issue aren't locked out of newer versions.

KuraiAI avatar Mar 05 '25 20:03 KuraiAI

Having exact same issue but if I go above 0.3.4

What CUDA are you using and are you using pre-made wheel using https://abetlen.github.io/llama-cpp-python/whl/ for example?

mcglynnfinn avatar Mar 09 '25 09:03 mcglynnfinn

Same issue in Docker version... womp.

PeterTucker avatar Mar 12 '25 18:03 PeterTucker

I have a similar issue, and the model I use is not supported in v0.3.6.

Willian7004 avatar Mar 14 '25 09:03 Willian7004

I meet the same question

keepkeen avatar Apr 09 '25 06:04 keepkeen