llama.cpp
llama.cpp copied to clipboard
Low GPU usage of quantized Mixtral 8x22B for prompt processing on Metal
My computer is M1 Max Mac Studio with 32 Cores of GPU with 64 GB of RAM. macOS version is Sonoma 14.4.1.
I run llama-bench
from commit 4cc120c7443cf9dab898736f3c3b45dc8f14672b and it shows low GPU usage for prompt processing. Of course, inferences on main
and server
show same low GPU usages.
In the above image, I run benchmark for IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, and Q2_K_S but IQ1_S and IQ1_M from https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-v0.1-GGUF will show same low GPU usage.
#6740
This issue was closed because it has been inactive for 14 days since being marked as stale.