sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs
This PR extends the work introduced in https://github.com/ggml-org/llama.cpp/pull/12035. MMVQ Q4_0 now supports the block_q_t reorder layout.
The improvements are reflected in Text generation. The improvement of PP512 in the DataMax 1100 is noise.
The PR includes:
- A refactor of vecdot traits, defined in the
reorder_vec_dot_q_syclstruct. - A new entrypoint for reordered MMVQ vecdots
reorder_mul_mat_vec_q4_0_q8_1_sycl - The new helper function
safe_div, to be more consistent with the naming of other backends for ceil/roundup division
Still pending TODOs:
- [x] Improve and find a proper location for the comment describing the reordered layout
- [x] Default to DMMV if the reordered Q4_0 is not supported.
- [ ] Get the performance for an Arc7X0
Benchmarking
Compiler: ICPX 2025.1
Builds:
- 02082f15 (b4963)
- 44e199dd (This PR)
GPU & Drivers:
- Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.32567+16]
- Intel(R) Data Center GPU Max 1100 12.60.7 [1.6.32567+18]
- Lunar Lake, Intel(R) Arc(TM) Graphics 20.4.4 [1.6.32567+16] (iGPU)
DISABLE_OPT is the value of GGML_SYCL_DISABLE_OPT
| GPU | model | backend | ngl | DISABLE_OPT | (02082f15) pp512 | (44e199dd) pp512 | (02082f15) tg128 | (44e199dd) tg128 |
|---|---|---|---|---|---|---|---|---|
| B580 | qwen2 1.5B Q4_0 | SYCL | 99 | 0 | 6286.16 ± 14.00 | 6233.05 ± 22.66 | 105.35 ± 1.70 | 134.61 ± 5.38 |
| B580 | llama 7B Q4_0 | SYCL | 99 | 0 | 1649.27 ± 1.84 | 1648.96 ± 2.41 | 40.97 ± 0.19 | 65.21 ± 0.21 |
| B580 | phi3 3B Q4_0 | SYCL | 99 | 0 | 2461.62 ± 3.06 | 2462.38 ± 3.46 | 62.36 ± 0.43 | 94.31 ± 0.20 |
| B580 | qwen2 1.5B Q4_0 | SYCL | 99 | 1 | 7863.81 ± 30.10 | 7813.15 ± 55.52 | 100.45 ± 2.72 | 96.97 ± 0.32 |
| B580 | llama 7B Q4_0 | SYCL | 99 | 1 | 2211.87 ± 1.64 | 2212.20 ± 1.83 | 40.03 ± 0.22 | 39.85 ± 0.08 |
| B580 | phi3 3B Q4_0 | SYCL | 99 | 1 | 3133.46 ± 5.73 | 3132.75 ± 4.61 | 61.17 ± 0.34 | 61.75 ± 0.45 |
| GPU | model | backend | ngl | DISABLE_OPT | (02082f15) pp512 | (44e199dd) pp512 | (02082f15) tg128 | (44e199dd) tg128 |
|---|---|---|---|---|---|---|---|---|
| DataMax 1100 | qwen2 1.5B Q4_0 | SYCL | 99 | 0 | 6759.80 ± 38.41 | 7272.88 ± 40.08 | 121.96 ± 1.07 | 143.40 ± 0.90 |
| DataMax 1100 | llama 7B Q4_0 | SYCL | 99 | 0 | 1778.88 ± 6.92 | 1793.16 ± 7.07 | 56.72 ± 0.25 | 71.40 ± 0.41 |
| DataMax 1100 | phi3 3B Q4_0 | SYCL | 99 | 0 | 2863.51 ± 13.92 | 2867.34 ± 4.07 | 92.18 ± 0.20 | 110.15 ± 0.57 |
| DataMax 1100 | qwen2 1.5B Q4_0 | SYCL | 99 | 1 | 9169.12 ± 142.33 | 9350.59 ± 60.30 | 94.20 ± 0.43 | 94.29 ± 0.41 |
| DataMax 1100 | llama 7B Q4_0 | SYCL | 99 | 1 | 2543.61 ± 8.16 | 2553.34 ± 22.99 | 36.27 ± 0.09 | 36.61 ± 0.07 |
| DataMax 1100 | phi3 3B Q4_0 | SYCL | 99 | 1 | 3952.37 ± 24.30 | 3938.14 ± 23.66 | 66.91 ± 0.07 | 67.24 ± 0.15 |
| GPU | model | backend | ngl | DISABLE_OPT | (02082f15) pp512 | (44e199dd) pp512 | (02082f15) tg128 | (44e199dd) tg128 |
|---|---|---|---|---|---|---|---|---|
| Arc 140V | qwen2 1.5B Q4_0 | SYCL | 99 | 0 | 1100.09 ± 1.13 | 1127.81 ± 35.95 | 38.14 ± 0.38 | 46.03 ± 0.21 |
| Arc 140V | llama 7B Q4_0 | SYCL | 99 | 0 | 316.47 ± 0.41 | 321.85 ± 5.92 | 13.09 ± 0.73 | 20.52 ± 0.04 |
| Arc 140V | phi3 3B Q4_0 | SYCL | 99 | 0 | 512.94 ± 0.32 | 515.34 ± 1.68 | 20.49 ± 0.33 | 30.56 ± 0.10 |
| Arc 140V | qwen2 1.5B Q4_0 | SYCL | 99 | 1 | 1492.78 ± 60.74 | 1514.96 ± 55.74 | 34.06 ± 0.23 | 33.80 ± 1.07 |
| Arc 140V | llama 7B Q4_0 | SYCL | 99 | 1 | 519.44 ± 0.78 | 393.29 ± 17.66 | 11.69 ± 0.45 | 12.04 ± 0.94 |
| Arc 140V | phi3 3B Q4_0 | SYCL | 99 | 1 | 752.71 ± 21.10 | 787.60 ± 6.11 | 18.77 ± 0.04 | 18.85 ± 0.10 |
@Alcpz I fix the known issues of Q4_0 reorder optimization by https://github.com/ggml-org/llama.cpp/pull/13003. You could refer to for this PR. The result wrong issue could be fixed easily.
@Alcpz I find the PR still don't resolve the wrong result issue on llama7b-Q_0. The performance is reduced too.
We hope this feature can increase the performance more. But in this test case, it's reduced.
test by: /examples/sycl/build.sh ./examples/sycl/run-llama2.sh
As following setting: GGML_SYCL_DISABLE_OPT: 0 GGML_SYCL_PRIORITIZE_DMMV: 0
Result Step 1: Get to know the basics of web design Step 2: Set up a web hosting account Step 3: Download a free website builder Step 4: Set up a domain name Step 5: Design your website Step 6: Add content to your site Step 7: Make the site responsive Step 8: Add a contact form Step 9: Add a social media share button Step 10: Advertise your website
llama_perf_context_print: prompt eval time = 322.34 ms / 19 tokens ( 16.97 ms per token, 58.94 tokens per second) llama_perf_context_print: eval time = 10927.57 ms / 399 runs ( 27.39 ms per token, 36.51 tokens per second)
But after enable the new feature in this PR: GGML_SYCL_DISABLE_OPT: 0 GGML_SYCL_PRIORITIZE_DMMV: 1
Step 1: Select the Website Name Step 2: Select the Website Type Step 3: Choose a Website Theme Step 4: Choose a Website Name Step 5: Choose a Website Theme Step 6: Choose a Website Name Step 7: Choose a Website Theme Step 8: Choose a Website Name Step 9: Choose a Website Theme Step 10: Choose a Website Theme
llama_perf_context_print: prompt eval time = 294.09 ms / 19 tokens ( 15.48 ms per token, 64.61 tokens per second) llama_perf_context_print: eval time = 14253.18 ms / 399 runs ( 35.72 ms per token, 27.99 tokens per second)
@Alcpz I suggest to revert this PR firstly. After the issue is fixed, merge it again.
@NeoZhangJianyu I recommend testing with llama 3 now. llama 2 is outdated and no one uses it.
I wonder why llama2 result is wrong. The reorder feature only change the data store position. The result of any LLM shouldn't be impacted.
llama2 is the design base of many other models. I'm afraid that other LLM based on llama2 will be impacted too.
In this PR, I guess the root cause is the code path is changed. It impact the result. But looks it is always ignored.
@NeoZhangJianyu I recommend testing with llama 3 now. llama 2 is outdated and no one uses it.
I test with llama3-8b. The result wrong is present too.
qwen2-1.5b-instruct-q4_0.gguf
No impact of result.
I'll put a patch up that disables MMVQ for Q4_0 for now until I find out why this happens.
Edit: My point here is, that I'm not reverting all this work if you find issues with specific models. If there are problems we should detect them and fix them, our mul mat dispatching should be flexible enough to allow us to change these things with little effort.
@NeoZhangJianyu we've discussed this issue a bit more offline. So far we have not been able to reproduce the issue you mention so we don't want to get blocked by it. To make the discussions clearer could you open an issue in llama with the exact setup that you use (specific host and device HW and relevant versions of oneAPI, GPU drivers, etc.)? We are planning to share docker images of what we're using exactly. I think we need to better understand what the differences in our setup is. In the mean time we are planning to keep these changes as they are and progress with https://github.com/ggml-org/llama.cpp/pull/13109. It would also be useful to see if more users run into similar issues or if it is specific to your setup.
(edited missing "not")
I'll put a patch up that disables MMVQ for Q4_0 for now until I find out why this happens.
Edit: My point here is, that I'm not reverting all this work if you find issues with specific models. If there are problems we should detect them and fix them, our mul mat dispatching should be flexible enough to allow us to change these things with little effort.
LGTM!
@NeoZhangJianyu we've discussed this issue a bit more offline. So far we have not been able to reproduce the issue you mention so we don't want to get blocked by it. To make the discussions clearer could you open an issue in llama with the exact setup that you use (specific host and device HW and relevant versions of oneAPI, GPU drivers, etc.)? We are planning to share docker images of what we're using exactly. I think we need to better understand what the differences in our setup is. In the mean time we are planning to keep these changes as they are and progress with #13109. It would also be useful to see if more users run into similar issues or if it is specific to your setup.
(edited missing "not")
My test case is very simple. It's easy to be reproduced.
Ubuntu 22.04, Arc 770, oneAPI 2025.0.
I don't think we need an issue to trace this issue when we find it during the PR review. I add the ENV Variable GGML_SYCL_DISABLE_OPT to enable/disable the reorder feature to test the effect of this feature.
During I develop the reorder feature, I notice the wrong result issue, through the UT case is passed. UT case allow error <0.5. But we need smaller error in real inference of LLM. I has spent more time to handle this issue.
Looks like the inference result check is easy ignored.