llama.cpp
llama.cpp copied to clipboard
Bug: IQ3_M is significantly slower than IQ4_XS on AMD, is it expected?
What happened?
Model: https://huggingface.co/bartowski/gemma-2-27b-it-GGUF AMD GPU: RX 7600 XT + RX 7600 (full offload) With IQ3_M I get about 10 t/s when IQ4_XS is nearly 15 t/s. I thought smaller models would run faster due to lessened memory bandwidth, and they are both IQ.
Name and Version
version: 3827 (7691654c) built with Ubuntu clang version 14.0.6-1~kisak1~j for x86_64-pc-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
Oops, I'll retest master branch.
Retested with latest version, the same result.
Potentially related to issue 8760 which also mentions the difference between (IQ1, IQ2, IQ3) and (IQ4 / K)
On NVidia (3090), IQ3_M is faster than IQ4_XS (~40t/s against ~35t/s)
But, On 1x NVIDIA 3090 (DDR4-offload), IQ3_S and IQ3_M are slower than IQ4_XS (about 0.5x speed) I seem that Only NVIDIA can deal IQ3 with highspeed.
This issue was closed because it has been inactive for 14 days since being marked as stale.