llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: Vulkan backend shows negative scaling at low batch sizes with MOE models

Open Mushoz opened this issue 3 months ago • 13 comments

Name and Version

[docker@7158e8afaf9c ~]$ llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat version: 6527 (7f766929) built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Command line

llama-batched-bench -m .cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 --no-mmap -c 0 -ntg 128 -npp 512 -npl 1,2,3,4,5,6,7,8

Problem description & steps to reproduce

When benching dense models through llama-batched-bench, the vulkan backend shows nice scaling across all batch sizes. Eg, Qwen3-8b q8_0:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.758 |   675.37 |    4.953 |    25.84 |    5.711 |   112.06 |
|   512 |    128 |    2 |   1280 |    1.382 |   740.84 |    5.058 |    50.61 |    6.440 |   198.75 |
|   512 |    128 |    3 |   1920 |    2.282 |   673.16 |    5.257 |    73.04 |    7.539 |   254.67 |
|   512 |    128 |    4 |   2560 |    2.913 |   702.98 |    5.441 |    94.09 |    8.355 |   306.41 |
|   512 |    128 |    5 |   3200 |    3.684 |   694.80 |    5.593 |   114.43 |    9.277 |   344.93 |
|   512 |    128 |    6 |   3840 |    4.408 |   696.92 |    5.841 |   131.47 |   10.249 |   374.66 |
|   512 |    128 |    7 |   4480 |    5.227 |   685.71 |    6.002 |   149.29 |   11.228 |   398.99 |
|   512 |    128 |    8 |   5120 |    5.935 |   690.16 |    6.202 |   165.11 |   12.137 |   421.85 |

But when trying to same with a MOE model (gpt-oss-120b in this case), there is negative scaling at batch sizes 2 and 3. I know MOE models scale worse as not every sequence will activate the same experts (therefor there can be less weight sharing between sequences), but I would expect some positive improvement as batch size increases, not the current negative scaling:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.281 |   399.82 |    2.531 |    50.58 |    3.811 |   167.92 |
|   512 |    128 |    2 |   1280 |    2.527 |   405.27 |    7.296 |    35.09 |    9.823 |   130.31 |
|   512 |    128 |    3 |   1920 |    3.879 |   395.98 |    8.605 |    44.62 |   12.484 |   153.79 |
|   512 |    128 |    4 |   2560 |    4.960 |   412.93 |    9.623 |    53.21 |   14.582 |   175.55 |
|   512 |    128 |    5 |   3200 |    6.187 |   413.78 |   10.704 |    59.79 |   16.891 |   189.45 |
|   512 |    128 |    6 |   3840 |    7.419 |   414.05 |   11.554 |    66.47 |   18.974 |   202.39 |
|   512 |    128 |    7 |   4480 |    8.851 |   404.92 |   12.547 |    71.41 |   21.398 |   209.36 |
|   512 |    128 |    8 |   5120 |    9.971 |   410.79 |   13.604 |    75.27 |   23.575 |   217.18 |

First Bad Commit

No response

Relevant log output


Mushoz avatar Sep 20 '25 19:09 Mushoz

Actually, it might not be related to MOE models, but to gpt-oss-120b (either the model architecture or its special quant) specifically. If I run Qwen3-30B-A3B q8_0 I get the following result:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.701 |   730.14 |    2.331 |    54.90 |    3.033 |   211.04 |
|   512 |    128 |    2 |   1280 |    1.348 |   759.81 |    4.546 |    56.31 |    5.894 |   217.18 |
|   512 |    128 |    3 |   1920 |    2.082 |   737.83 |    5.490 |    69.94 |    7.572 |   253.56 |
|   512 |    128 |    4 |   2560 |    2.640 |   775.83 |    6.192 |    82.68 |    8.832 |   289.85 |
|   512 |    128 |    5 |   3200 |    3.317 |   771.77 |    6.899 |    92.77 |   10.216 |   313.24 |
|   512 |    128 |    6 |   3840 |    3.980 |   771.87 |    7.470 |   102.81 |   11.450 |   335.37 |
|   512 |    128 |    7 |   4480 |    4.745 |   755.38 |    8.107 |   110.52 |   12.852 |   348.60 |
|   512 |    128 |    8 |   5120 |    5.372 |   762.53 |    8.457 |   121.09 |   13.828 |   370.26 |

As can be seen, it scales quite nicely being an MOE.

edit: I guess it's then either caused by the MXFP4 quant or the SWA attention? I will try out Gemma 3 (also using SWA) and gpt-oss-20b next.

edit 2: I can see some other models have also been quanted using MXFP4. I will try those to see if I get similar negative scaling. That could confirm the quant is to blame.

Mushoz avatar Sep 20 '25 19:09 Mushoz

@Mushoz could you check if #15363 give you better speed?

lovedheart avatar Sep 20 '25 20:09 lovedheart

@Mushoz could you check if #15363 give you better speed?

I see the same negative scaling at batchsize 2 & 3, and overall performance is ever so slightly lower as well:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.389 |   368.58 |    2.587 |    49.47 |    3.976 |   160.95 |
|   512 |    128 |    2 |   1280 |    2.557 |   400.43 |    7.386 |    34.66 |    9.944 |   128.73 |
|   512 |    128 |    3 |   1920 |    3.950 |   388.89 |    8.807 |    43.60 |   12.757 |   150.51 |
|   512 |    128 |    4 |   2560 |    5.125 |   399.58 |    9.797 |    52.26 |   14.923 |   171.55 |
|   512 |    128 |    5 |   3200 |    6.267 |   408.49 |   10.791 |    59.31 |   17.058 |   187.60 |
|   512 |    128 |    6 |   3840 |    7.533 |   407.79 |   11.762 |    65.29 |   19.295 |   199.01 |
|   512 |    128 |    7 |   4480 |    8.936 |   401.09 |   12.661 |    70.77 |   21.597 |   207.44 |
|   512 |    128 |    8 |   5120 |   10.090 |   405.95 |   13.631 |    75.12 |   23.721 |   215.84 |

Mushoz avatar Sep 20 '25 20:09 Mushoz

Gemma 3 did NOT show the issue, so SWA can be ruled out.

gpt-oss-20b does show the same issue:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.541 |   945.81 |    1.803 |    70.99 |    2.344 |   272.98 |
|   512 |    128 |    2 |   1280 |    0.963 |  1063.65 |    4.536 |    56.44 |    5.499 |   232.78 |
|   512 |    128 |    3 |   1920 |    1.518 |  1012.09 |    5.397 |    71.16 |    6.914 |   277.69 |
|   512 |    128 |    4 |   2560 |    1.921 |  1066.19 |    5.834 |    87.76 |    7.755 |   330.10 |
|   512 |    128 |    5 |   3200 |    2.366 |  1081.87 |    6.406 |    99.91 |    8.772 |   364.80 |
|   512 |    128 |    6 |   3840 |    2.867 |  1071.66 |    6.765 |   113.52 |    9.632 |   398.68 |
|   512 |    128 |    7 |   4480 |    3.397 |  1055.14 |    7.260 |   123.41 |   10.657 |   420.39 |
|   512 |    128 |    8 |   5120 |    3.822 |  1071.70 |    7.727 |   132.51 |   11.549 |   443.31 |

GLM 4.5 Air quantized to MXFP4 does show a similar issue (although it only dips at batchsize 2 and is roughly equal at batchsize 3), so I guess it seems to be triggered by MXFP4 quantization:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    2.538 |   201.75 |    6.632 |    19.30 |    9.170 |    69.79 |
|   512 |    128 |    2 |   1280 |    5.020 |   204.00 |   14.947 |    17.13 |   19.967 |    64.11 |
|   512 |    128 |    3 |   1920 |    7.844 |   195.82 |   18.559 |    20.69 |   26.403 |    72.72 |
|   512 |    128 |    4 |   2560 |    9.982 |   205.18 |   21.514 |    23.80 |   31.496 |    81.28 |
|   512 |    128 |    5 |   3200 |   12.565 |   203.74 |   24.395 |    26.24 |   36.959 |    86.58 |
|   512 |    128 |    6 |   3840 |   15.024 |   204.48 |   26.716 |    28.75 |   41.740 |    92.00 |
|   512 |    128 |    7 |   4480 |   17.840 |   200.90 |   29.204 |    30.68 |   47.043 |    95.23 |
|   512 |    128 |    8 |   5120 |   20.190 |   202.87 |   31.632 |    32.37 |   51.822 |    98.80 |

I will retest GLM 4.5 Air tomorrow with a non-MXFP4 quantization to confirm it's really the quantization causing it. It's getting late over here :)

Mushoz avatar Sep 20 '25 21:09 Mushoz

Actually, it seems my initial post is still mostly correct. It obviously doesn't impact all MOE models equally (strongly depends on quant), but running the test with Qwen3-30B-A3B Q4_K_S shows the exact same problem (so it's not MXFP4 specific). As a matter of fact, it's much worse with Q4_K_S:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.770 |   664.76 |    1.577 |    81.17 |    2.347 |   272.68 |
|   512 |    128 |    2 |   1280 |    1.485 |   689.68 |    5.884 |    43.51 |    7.369 |   173.71 |
|   512 |    128 |    3 |   1920 |    2.413 |   636.63 |    7.002 |    54.85 |    9.414 |   203.95 |
|   512 |    128 |    4 |   2560 |    3.114 |   657.67 |    7.982 |    64.14 |   11.096 |   230.71 |
|   512 |    128 |    5 |   3200 |    3.822 |   669.73 |    9.014 |    71.00 |   12.837 |   249.28 |
|   512 |    128 |    6 |   3840 |    4.662 |   658.94 |    9.906 |    77.53 |   14.568 |   263.59 |
|   512 |    128 |    7 |   4480 |    5.471 |   655.12 |   10.785 |    83.08 |   16.256 |   275.59 |
|   512 |    128 |    8 |   5120 |    6.155 |   665.53 |   11.393 |    89.88 |   17.548 |   291.78 |

Qwen3-30B-A3B MXFP4 shows the same issue, but it's much less severe:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.667 |   768.17 |    1.914 |    66.88 |    2.581 |   248.01 |
|   512 |    128 |    2 |   1280 |    1.327 |   771.82 |    4.488 |    57.04 |    5.815 |   220.13 |
|   512 |    128 |    3 |   1920 |    2.092 |   734.21 |    5.380 |    71.38 |    7.472 |   256.97 |
|   512 |    128 |    4 |   2560 |    2.688 |   761.90 |    5.991 |    85.47 |    8.679 |   294.98 |
|   512 |    128 |    5 |   3200 |    3.341 |   766.18 |    6.655 |    96.17 |    9.996 |   320.12 |
|   512 |    128 |    6 |   3840 |    4.020 |   764.20 |    7.201 |   106.65 |   11.221 |   342.21 |
|   512 |    128 |    7 |   4480 |    4.765 |   752.10 |    7.854 |   114.08 |   12.619 |   355.01 |
|   512 |    128 |    8 |   5120 |    5.432 |   753.99 |    8.299 |   123.38 |   13.732 |   372.85 |

Qwen3-30B-A3B Q4_0 again shows the same issue:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.669 |   765.47 |    1.637 |    78.18 |    2.306 |   277.51 |
|   512 |    128 |    2 |   1280 |    1.277 |   802.17 |    4.107 |    62.33 |    5.384 |   237.75 |
|   512 |    128 |    3 |   1920 |    2.058 |   746.43 |    4.775 |    80.42 |    6.833 |   281.01 |
|   512 |    128 |    4 |   2560 |    2.544 |   804.95 |    5.440 |    94.12 |    7.984 |   320.62 |
|   512 |    128 |    5 |   3200 |    3.188 |   803.10 |    5.985 |   106.93 |    9.173 |   348.87 |
|   512 |    128 |    6 |   3840 |    3.834 |   801.35 |    6.507 |   118.04 |   10.340 |   371.37 |
|   512 |    128 |    7 |   4480 |    4.541 |   789.23 |    7.177 |   124.84 |   11.718 |   382.31 |
|   512 |    128 |    8 |   5120 |    5.123 |   799.50 |    7.587 |   134.98 |   12.710 |   402.84 |

Mushoz avatar Sep 21 '25 14:09 Mushoz

From my understanding, the backends use vec_mat_muls for the single batch case and mat_mat_muls for the batched case. Could it be that the mat_mat_mul pathway is being used, but since at low batchsizes (such as 2, which shows the issue the strongest) most experts are still only used by one sequence at a time, it's using a mat_mat_mul pathway with the first matrix having a second dimension of length 1 (because it cannot batch anything when only a single sequence is using a specific expert), which is inefficient compared to the vec_mat_mul pathway?

Mushoz avatar Sep 21 '25 14:09 Mushoz

That was going to be my guess too - we have an optimized path for N=[2,8] for MAT_MUL but not for MAT_MUL_ID.

jeffbolznv avatar Sep 21 '25 14:09 jeffbolznv

That was going to be my guess too - we have an optimized path for N=[2,8] for MAT_MUL but not for MAT_MUL_ID.

How difficult would it be to check the dimensions of the input matrices actually being used in each expert and fall back to the optimized vec_mat_mul pathway if the last dimension is 1?

Mushoz avatar Sep 21 '25 14:09 Mushoz

I don't think it'll be that straightforward, but I have to refresh my memory on how the dimensions work for mat_mul_id every time I touch it, so I'm not totally sure. MAT_MUL is very bandwidth limited loading the A matrix, and we take advantage of that to do multiple cols of B each time we load a row of A. For MAT_MUL_ID, there are effectively multiple A matrices, and one is selected for each row of the output based on the expert id. Maybe sometimes they would all be the same expert, but sometimes they won't, and if they aren't then I don't think we should expect a similar speedup.

jeffbolznv avatar Sep 21 '25 14:09 jeffbolznv

One thing to note, is that disabling coopmat seems to make the issue worse:

Qwen3-30B-A3B Q4_K_S:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.791 |   646.97 |    1.624 |    78.82 |    2.415 |   264.98 |
|   512 |    128 |    2 |   1280 |    1.626 |   629.59 |    6.137 |    41.72 |    7.763 |   164.88 |
|   512 |    128 |    3 |   1920 |    2.536 |   605.68 |    7.328 |    52.40 |    9.864 |   194.65 |
|   512 |    128 |    4 |   2560 |    3.232 |   633.72 |    8.392 |    61.01 |   11.624 |   220.24 |
|   512 |    128 |    5 |   3200 |    4.042 |   633.39 |    9.471 |    67.58 |   13.512 |   236.82 |
|   512 |    128 |    6 |   3840 |    4.865 |   631.42 |   10.378 |    74.00 |   15.243 |   251.92 |
|   512 |    128 |    7 |   4480 |    5.770 |   621.11 |   11.393 |    78.64 |   17.163 |   261.02 |
|   512 |    128 |    8 |   5120 |    6.509 |   629.25 |   12.075 |    84.80 |   18.584 |   275.50 |

Qwen3-30B-A3B Q4_K_S GGML_VK_DISABLE_COOPMAT=1:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.146 |   446.75 |    1.676 |    76.35 |    2.823 |   226.74 |
|   512 |    128 |    2 |   1280 |    2.191 |   467.47 |    7.799 |    32.83 |    9.989 |   128.14 |
|   512 |    128 |    3 |   1920 |    3.374 |   455.26 |    9.379 |    40.94 |   12.753 |   150.55 |
|   512 |    128 |    4 |   2560 |    4.303 |   475.96 |   10.617 |    48.22 |   14.920 |   171.58 |
|   512 |    128 |    5 |   3200 |    5.398 |   474.27 |   11.987 |    53.39 |   17.385 |   184.07 |
|   512 |    128 |    6 |   3840 |    6.501 |   472.52 |   13.082 |    58.71 |   19.583 |   196.09 |
|   512 |    128 |    7 |   4480 |    7.697 |   465.63 |   14.447 |    62.02 |   22.144 |   202.31 |
|   512 |    128 |    8 |   5120 |    8.805 |   465.17 |   15.218 |    67.29 |   24.024 |   213.12 |

Not sure if it's useful / relevant information, but wanted to share it just in case

Mushoz avatar Sep 21 '25 14:09 Mushoz

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Nov 05 '25 01:11 github-actions[bot]

I am still interested in this issue. Let me know if I can help in any way with your research @0cc4m

Mushoz avatar Nov 08 '25 19:11 Mushoz

Yeah, it would be good to improve it, we just have to figure out a way to do that. Maybe small batches of mul_mat_id could be split up into multiple mul_mat_vec shader calls. Do you know how CUDA and Metal handle this case?

0cc4m avatar Nov 09 '25 08:11 0cc4m