llama.cpp Low GPU usage of quantized Mixtral 8x22B for prompt processing on Metal

Low GPU usage of quantized Mixtral 8x22B for prompt processing on Metal

Open beebopkim opened this issue 10 months ago • 1 comments

My computer is M1 Max Mac Studio with 32 Cores of GPU with 64 GB of RAM. macOS version is Sonoma 14.4.1.

I run llama-bench from commit 4cc120c7443cf9dab898736f3c3b45dc8f14672b and it shows low GPU usage for prompt processing. Of course, inferences on main and server show same low GPU usages.

Screenshot 2024-04-13 at 12 40 32 AM-2

In the above image, I run benchmark for IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, and Q2_K_S but IQ1_S and IQ1_M from https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-v0.1-GGUF will show same low GPU usage.

Apr 12 '24 15:04 beebopkim

#6740

Apr 19 '24 09:04 stefanvarunix

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 04 '24 01:06 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

Low GPU usage of quantized Mixtral 8x22B for prompt processing on Metal

llama.cpp
llama.cpp copied to clipboard