llama.cpp sycl : implementation of reordered Q4

This PR extends the work introduced in https://github.com/ggml-org/llama.cpp/pull/12035. MMVQ Q4_0 now supports the block_q_t reorder layout.

The improvements are reflected in Text generation. The improvement of PP512 in the DataMax 1100 is noise.

The PR includes:

A refactor of vecdot traits, defined in the reorder_vec_dot_q_sycl struct.
A new entrypoint for reordered MMVQ vecdots reorder_mul_mat_vec_q4_0_q8_1_sycl
The new helper function safe_div, to be more consistent with the naming of other backends for ceil/roundup division

Still pending TODOs:

[x] Improve and find a proper location for the comment describing the reordered layout
[x] Default to DMMV if the reordered Q4_0 is not supported.
[ ] Get the performance for an Arc7X0

Benchmarking

Compiler: ICPX 2025.1

Builds:

02082f15 (b4963)
44e199dd (This PR)

GPU & Drivers:

Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.32567+16]
Intel(R) Data Center GPU Max 1100 12.60.7 [1.6.32567+18]
Lunar Lake, Intel(R) Arc(TM) Graphics 20.4.4 [1.6.32567+16] (iGPU)

DISABLE_OPT is the value of GGML_SYCL_DISABLE_OPT

GPU	model	backend	ngl	DISABLE_OPT	(02082f15) pp512	(44e199dd) pp512	(02082f15) tg128	(44e199dd) tg128
B580	qwen2 1.5B Q4_0	SYCL	99	0	6286.16 ± 14.00	6233.05 ± 22.66	105.35 ± 1.70	134.61 ± 5.38
B580	llama 7B Q4_0	SYCL	99	0	1649.27 ± 1.84	1648.96 ± 2.41	40.97 ± 0.19	65.21 ± 0.21
B580	phi3 3B Q4_0	SYCL	99	0	2461.62 ± 3.06	2462.38 ± 3.46	62.36 ± 0.43	94.31 ± 0.20
B580	qwen2 1.5B Q4_0	SYCL	99	1	7863.81 ± 30.10	7813.15 ± 55.52	100.45 ± 2.72	96.97 ± 0.32
B580	llama 7B Q4_0	SYCL	99	1	2211.87 ± 1.64	2212.20 ± 1.83	40.03 ± 0.22	39.85 ± 0.08
B580	phi3 3B Q4_0	SYCL	99	1	3133.46 ± 5.73	3132.75 ± 4.61	61.17 ± 0.34	61.75 ± 0.45

GPU	model	backend	ngl	DISABLE_OPT	(02082f15) pp512	(44e199dd) pp512	(02082f15) tg128	(44e199dd) tg128
DataMax 1100	qwen2 1.5B Q4_0	SYCL	99	0	6759.80 ± 38.41	7272.88 ± 40.08	121.96 ± 1.07	143.40 ± 0.90
DataMax 1100	llama 7B Q4_0	SYCL	99	0	1778.88 ± 6.92	1793.16 ± 7.07	56.72 ± 0.25	71.40 ± 0.41
DataMax 1100	phi3 3B Q4_0	SYCL	99	0	2863.51 ± 13.92	2867.34 ± 4.07	92.18 ± 0.20	110.15 ± 0.57
DataMax 1100	qwen2 1.5B Q4_0	SYCL	99	1	9169.12 ± 142.33	9350.59 ± 60.30	94.20 ± 0.43	94.29 ± 0.41
DataMax 1100	llama 7B Q4_0	SYCL	99	1	2543.61 ± 8.16	2553.34 ± 22.99	36.27 ± 0.09	36.61 ± 0.07
DataMax 1100	phi3 3B Q4_0	SYCL	99	1	3952.37 ± 24.30	3938.14 ± 23.66	66.91 ± 0.07	67.24 ± 0.15

GPU	model	backend	ngl	DISABLE_OPT	(02082f15) pp512	(44e199dd) pp512	(02082f15) tg128	(44e199dd) tg128
Arc 140V	qwen2 1.5B Q4_0	SYCL	99	0	1100.09 ± 1.13	1127.81 ± 35.95	38.14 ± 0.38	46.03 ± 0.21
Arc 140V	llama 7B Q4_0	SYCL	99	0	316.47 ± 0.41	321.85 ± 5.92	13.09 ± 0.73	20.52 ± 0.04
Arc 140V	phi3 3B Q4_0	SYCL	99	0	512.94 ± 0.32	515.34 ± 1.68	20.49 ± 0.33	30.56 ± 0.10
Arc 140V	qwen2 1.5B Q4_0	SYCL	99	1	1492.78 ± 60.74	1514.96 ± 55.74	34.06 ± 0.23	33.80 ± 1.07
Arc 140V	llama 7B Q4_0	SYCL	99	1	519.44 ± 0.78	393.29 ± 17.66	11.69 ± 0.45	12.04 ± 0.94
Arc 140V	phi3 3B Q4_0	SYCL	99	1	752.71 ± 21.10	787.60 ± 6.11	18.77 ± 0.04	18.85 ± 0.10

Apr 10 '25 00:04 Alcpz

@Alcpz I fix the known issues of Q4_0 reorder optimization by https://github.com/ggml-org/llama.cpp/pull/13003. You could refer to for this PR. The result wrong issue could be fixed easily.

Apr 18 '25 06:04 NeoZhangJianyu

@Alcpz I find the PR still don't resolve the wrong result issue on llama7b-Q_0. The performance is reduced too.

We hope this feature can increase the performance more. But in this test case, it's reduced.

test by: /examples/sycl/build.sh ./examples/sycl/run-llama2.sh

As following setting: GGML_SYCL_DISABLE_OPT: 0 GGML_SYCL_PRIORITIZE_DMMV: 0

Result Step 1: Get to know the basics of web design Step 2: Set up a web hosting account Step 3: Download a free website builder Step 4: Set up a domain name Step 5: Design your website Step 6: Add content to your site Step 7: Make the site responsive Step 8: Add a contact form Step 9: Add a social media share button Step 10: Advertise your website

llama_perf_context_print: prompt eval time = 322.34 ms / 19 tokens ( 16.97 ms per token, 58.94 tokens per second) llama_perf_context_print: eval time = 10927.57 ms / 399 runs ( 27.39 ms per token, 36.51 tokens per second)

But after enable the new feature in this PR: GGML_SYCL_DISABLE_OPT: 0 GGML_SYCL_PRIORITIZE_DMMV: 1

Step 1: Select the Website Name Step 2: Select the Website Type Step 3: Choose a Website Theme Step 4: Choose a Website Name Step 5: Choose a Website Theme Step 6: Choose a Website Name Step 7: Choose a Website Theme Step 8: Choose a Website Name Step 9: Choose a Website Theme Step 10: Choose a Website Theme

llama_perf_context_print: prompt eval time = 294.09 ms / 19 tokens ( 15.48 ms per token, 64.61 tokens per second) llama_perf_context_print: eval time = 14253.18 ms / 399 runs ( 35.72 ms per token, 27.99 tokens per second)

May 13 '25 07:05 NeoZhangJianyu

@Alcpz I suggest to revert this PR firstly. After the issue is fixed, merge it again.

May 13 '25 07:05 NeoZhangJianyu

@NeoZhangJianyu I recommend testing with llama 3 now. llama 2 is outdated and no one uses it.

May 13 '25 07:05 qnixsynapse

I wonder why llama2 result is wrong. The reorder feature only change the data store position. The result of any LLM shouldn't be impacted.

llama2 is the design base of many other models. I'm afraid that other LLM based on llama2 will be impacted too.

In this PR, I guess the root cause is the code path is changed. It impact the result. But looks it is always ignored.

May 13 '25 07:05 NeoZhangJianyu

@NeoZhangJianyu I recommend testing with llama 3 now. llama 2 is outdated and no one uses it.

I test with llama3-8b. The result wrong is present too.

May 13 '25 07:05 NeoZhangJianyu

qwen2-1.5b-instruct-q4_0.gguf

No impact of result.

May 13 '25 07:05 NeoZhangJianyu

I'll put a patch up that disables MMVQ for Q4_0 for now until I find out why this happens.

Edit: My point here is, that I'm not reverting all this work if you find issues with specific models. If there are problems we should detect them and fix them, our mul mat dispatching should be flexible enough to allow us to change these things with little effort.

May 13 '25 08:05 Alcpz

@NeoZhangJianyu we've discussed this issue a bit more offline. So far we have not been able to reproduce the issue you mention so we don't want to get blocked by it. To make the discussions clearer could you open an issue in llama with the exact setup that you use (specific host and device HW and relevant versions of oneAPI, GPU drivers, etc.)? We are planning to share docker images of what we're using exactly. I think we need to better understand what the differences in our setup is. In the mean time we are planning to keep these changes as they are and progress with https://github.com/ggml-org/llama.cpp/pull/13109. It would also be useful to see if more users run into similar issues or if it is specific to your setup.

(edited missing "not")

May 13 '25 11:05 Rbiessy

I'll put a patch up that disables MMVQ for Q4_0 for now until I find out why this happens.

Edit: My point here is, that I'm not reverting all this work if you find issues with specific models. If there are problems we should detect them and fix them, our mul mat dispatching should be flexible enough to allow us to change these things with little effort.

LGTM!

May 13 '25 14:05 NeoZhangJianyu

@NeoZhangJianyu we've discussed this issue a bit more offline. So far we have not been able to reproduce the issue you mention so we don't want to get blocked by it. To make the discussions clearer could you open an issue in llama with the exact setup that you use (specific host and device HW and relevant versions of oneAPI, GPU drivers, etc.)? We are planning to share docker images of what we're using exactly. I think we need to better understand what the differences in our setup is. In the mean time we are planning to keep these changes as they are and progress with #13109. It would also be useful to see if more users run into similar issues or if it is specific to your setup.

(edited missing "not")

My test case is very simple. It's easy to be reproduced.

Ubuntu 22.04, Arc 770, oneAPI 2025.0.

I don't think we need an issue to trace this issue when we find it during the PR review. I add the ENV Variable GGML_SYCL_DISABLE_OPT to enable/disable the reorder feature to test the effect of this feature.

During I develop the reorder feature, I notice the wrong result issue, through the UT case is passed. UT case allow error <0.5. But we need smaller error in real inference of LLM. I has spent more time to handle this issue.

Looks like the inference result check is easy ignored.

May 13 '25 15:05 NeoZhangJianyu

sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs

Benchmarking