llama.cpp sycl: use oneDNN for matrices multiplication

Summary:

SYCL is doing matrix multiplication using oneDNN instead of MKL
getting back to previous behavior by not using oneDNN and working with MKL can be forced by a new runtime flag: GGML_SYCL_DISABLE_DNN
compilation with oneDNN may be explicitly disabled by cmake flag: GGML_SYCL_DNN
several assertions were added, but they do not introduce any additional limitations, just document the code and limitations and assumptions which were already in the code before
some cleanups to use less code, remove unused variables

This is the first change towards making SYCL graphs working by avoiding using calls which issue sycl::wait inside recording phase. MKL uses waits a lot while oneDNN does not. Also using oneDNN should be preferred over using MKL in AI - hence the change.

Runtime flag is meant to ease performance comparison between MKL and oneDNN and is a safeguard for the custemers in case someone will find that oneDNN is slower in particulra case.

Compile time flag makes configuration of SYCL with oneDNN cleaner.

Apr 16 '25 11:04 lslusarczyk

Please make sure the CI/UT is passed with the PR.
There are several assert() to be added. Why add them? New code should support same cases of legacy code.

Apr 21 '25 06:04 NeoZhangJianyu

Please make sure the CI/UT is passed with the PR.

sure

There are several assert() to be added. Why add them? New code should support same cases of legacy code.

I've updated description of this PR to explain this. There are no new limitations added. New assertions just document already existing assumptions or document dependencies in the code. Such an assertion is a self-verifying documentation of the code.

May 07 '25 08:05 lslusarczyk

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this https://github.com/ggml-org/llama.cpp/blob/ac04335e97acde1e8884d66de73ab031dc0d6519/ggml/src/ggml-sycl/ggml-sycl.cpp#L3935-L3937)

May 12 '25 12:05 Alcpz

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this

https://github.com/ggml-org/llama.cpp/blob/ac04335e97acde1e8884d66de73ab031dc0d6519/ggml/src/ggml-sycl/ggml-sycl.cpp#L3935-L3937 )

Yes. I've worked last days on resolving conflicts and fixing test cases which started to fail after rebasing into recent changes. All is rebased and works for me locally now. I will test performance on my own again and if all looks good I will ask you here for testing perf again.

May 12 '25 14:05 lslusarczyk

Here's the performance I got comparing to master using FP32:

model	device	test	t/s bb1681fb (5376)	t/s 53113d0a (5352)	speedup
qwen2 1.5B Q4_0	Data Max 1100	pp512	5153.25	5176.24	1.0
qwen2 1.5B Q4_0	Data Max 1100	tg128	146.65	146.3	1.0
llama 7B Q4_0	Data Max 1100	pp512	1170.74	1170.12	1.0
llama 7B Q4_0	Data Max 1100	tg128	71.48	72.16	1.01
phi3 3B Q4_0	Data Max 1100	pp512	1909.14	1906.37	1.0
phi3 3B Q4_0	Data Max 1100	tg128	109.66	111.03	1.01
qwen2 1.5B Q4_0	Arc B580	pp512	3298.62	3327.74	1.01
qwen2 1.5B Q4_0	Arc B580	tg128	79.35	81.92	1.03
llama 7B Q4_0	Arc B580	pp512	781.32	781.11	1.0
llama 7B Q4_0	Arc B580	tg128	61.99	63.41	1.02
phi3 3B Q4_0	Arc B580	pp512	1259.04	1259.22	1.0
phi3 3B Q4_0	Arc B580	tg128	80.08	79.38	0.99
qwen2 1.5B Q4_0	Arc A770	pp512	3246.14	3268.02	1.01
qwen2 1.5B Q4_0	Arc A770	tg128	44.78	44.77	1.0
llama 7B Q4_0	Arc A770	pp512	790.7	789.74	1.0
llama 7B Q4_0	Arc A770	tg128	33.87	33.79	1.0
phi3 3B Q4_0	Arc A770	pp512	1257.26	1257.94	1.0
phi3 3B Q4_0	Arc A770	tg128	39.4	39.53	1.0
qwen2 1.5B Q4_0	Arc V140	pp512	886.79	884.95	1.0
qwen2 1.5B Q4_0	Arc V140	tg128	47.65	47.91	1.01
llama 7B Q4_0	Arc V140	pp512	207.99	208.04	1.0
llama 7B Q4_0	Arc V140	tg128	20.91	20.87	1.0
phi3 3B Q4_0	Arc V140	pp512	334.03	333.69	1.0
phi3 3B Q4_0	Arc V140	tg128	31.26	31.11	1.0

Speed up is calculated by dividing b5352/b5376 (this pr / master)

I'm re-runing using FP16 and some more models / quantizations, but so far looks good.

May 14 '25 14:05 Alcpz

Results for FP16:

So far results are roughly equivalent, even better in some cases. We have a couple of cases that may require looking a bit more into them, like qwen2 1.5B Q4_K - Medium on battlemage during TG and gemma2 2B Q4_K - Medium during prompt processing.

Benchmark Results

model	device	test	t/s bb1681fb (5376)	t/s 53113d0a (5352)	speedup
qwen2 1.5B Q4_0	Data Max 1100	pp512	6981.83	7050.21	1.01
qwen2 1.5B Q4_0	Data Max 1100	tg128	143.85	144.1	1
qwen2 1.5B Q4_K - Medium	Data Max 1100	pp512	10254.4	10172.7	0.99
qwen2 1.5B Q4_K - Medium	Data Max 1100	tg128	107.35	107.43	1
llama 7B Q4_0	Data Max 1100	pp512	1751.95	1751.15	1
llama 7B Q4_0	Data Max 1100	tg128	71.06	71.28	1
llama 7B Q4_K - Medium	Data Max 1100	pp512	3369.32	3378.41	1
llama 7B Q4_K - Medium	Data Max 1100	tg128	38.98	38.99	1
gemma2 2B Q4_K - Medium	Data Max 1100	pp512	7825.8	7798.84	1
gemma2 2B Q4_K - Medium	Data Max 1100	tg128	82.58	82.4	1
phi3 3B Q4_0	Data Max 1100	pp512	2818.08	2825.46	1
phi3 3B Q4_0	Data Max 1100	tg128	108.64	109.39	1.01
phi3 3B Q4_K - Medium	Data Max 1100	pp512	4505.99	4626.3	1.03
phi3 3B Q4_K - Medium	Data Max 1100	tg128	65.41	65.43	1
llama 34B Q6_K	Data Max 1100	pp512	1755.48	1763.75	1
llama 34B Q6_K	Data Max 1100	tg128	23.53	23.3	0.99
llama 8B Q4_K - Medium	Data Max 1100	pp512	3383.7	3338.83	0.99
llama 8B Q4_K - Medium	Data Max 1100	tg128	36.26	35.93	0.99
qwen2 1.5B Q4_0	Arc B580	pp512	5959.04	6052.54	1.02
qwen2 1.5B Q4_0	Arc B580	tg128	138.28	138.52	1
qwen2 1.5B Q4_K - Medium	Arc B580	pp512	7564.61	7502.2	0.99
qwen2 1.5B Q4_K - Medium	Arc B580	tg128	106.54	100.88	0.95
llama 7B Q4_0	Arc B580	pp512	1624.43	1624.43	1
llama 7B Q4_0	Arc B580	tg128	65.43	65.09	0.99
llama 7B Q4_K - Medium	Arc B580	pp512	2206.23	2201.93	1
llama 7B Q4_K - Medium	Arc B580	tg128	38.3	38.2	1
gemma2 2B Q4_K - Medium	Arc B580	pp512	5752.78	5722.25	0.99
gemma2 2B Q4_K - Medium	Arc B580	tg128	75.26	72.35	0.96
phi3 3B Q4_0	Arc B580	pp512	2410.01	2408.05	1
phi3 3B Q4_0	Arc B580	tg128	100.07	95.97	0.96
phi3 3B Q4_K - Medium	Arc B580	pp512	3167.63	3165.56	1
phi3 3B Q4_K - Medium	Arc B580	tg128	55.43	54.65	0.99
llama 34B Q6_K	Arc B580	pp512	1486.23	1483.83	1
llama 34B Q6_K	Arc B580	tg128	20.43	20.27	0.99
llama 8B Q4_K - Medium	Arc B580	pp512	2075.02	2067.28	1
llama 8B Q4_K - Medium	Arc B580	tg128	34.68	34.06	0.98
qwen2 1.5B Q4_0	Arc A770	pp512	4082.11	4071.16	1
qwen2 1.5B Q4_0	Arc A770	tg128	44.44	44.5	1
qwen2 1.5B Q4_K - Medium	Arc A770	pp512	4460.65	4458.04	1
qwen2 1.5B Q4_K - Medium	Arc A770	tg128	43.85	44.37	1.01
llama 7B Q4_0	Arc A770	pp512	1468.23	1468.99	1
llama 7B Q4_0	Arc A770	tg128	33.55	33.42	1
llama 7B Q4_K - Medium	Arc A770	pp512	1727.55	1727.02	1
llama 7B Q4_K - Medium	Arc A770	tg128	25.85	25.72	0.99
gemma2 2B Q4_K - Medium	Arc A770	pp512	3724.92	3623.25	0.97
gemma2 2B Q4_K - Medium	Arc A770	tg128	35.11	35.06	1
phi3 3B Q4_0	Arc A770	pp512	2163.5	2165.57	1
phi3 3B Q4_0	Arc A770	tg128	37.23	37.48	1.01
phi3 3B Q4_K - Medium	Arc A770	pp512	2502.81	2499.9	1
phi3 3B Q4_K - Medium	Arc A770	tg128	29.86	29.91	1
llama 34B Q6_K	Arc A770	pp512	1063.01	1029.61	0.97
llama 34B Q6_K	Arc A770	tg128	15.42	14.95	0.97
llama 8B Q4_K - Medium	Arc A770	pp512	1676.83	1620.34	0.97
llama 8B Q4_K - Medium	Arc A770	tg128	23.19	22.64	0.98
qwen2 1.5B Q4_0	Arc V140	pp512	1100.25	1086.64	0.99
qwen2 1.5B Q4_0	Arc V140	tg128	47.9	47.55	0.99
qwen2 1.5B Q4_K - Medium	Arc V140	pp512	1490	1512.93	1.02
qwen2 1.5B Q4_K - Medium	Arc V140	tg128	37.24	37.26	1
llama 7B Q4_0	Arc V140	pp512	323.82	323.29	1
llama 7B Q4_0	Arc V140	tg128	20.49	20.94	1.02
llama 7B Q4_K - Medium	Arc V140	pp512	548.41	549.69	1
llama 7B Q4_K - Medium	Arc V140	tg128	13.35	13.37	1
gemma2 2B Q4_K - Medium	Arc V140	pp512	647.98	594.44	0.92
gemma2 2B Q4_K - Medium	Arc V140	tg128	25.2	25.2	1
phi3 3B Q4_0	Arc V140	pp512	517.41	517.47	1
phi3 3B Q4_0	Arc V140	tg128	31.41	31.41	1
phi3 3B Q4_K - Medium	Arc V140	pp512	841.13	843.35	1
phi3 3B Q4_K - Medium	Arc V140	tg128	21.19	21.21	1
llama 34B Q6_K	Arc V140	pp512	368.28	367.2	1
llama 34B Q6_K	Arc V140	tg128	7.34	7.33	1
llama 8B Q4_K - Medium	Arc V140	pp512	507.21	504.61	0.99
llama 8B Q4_K - Medium	Arc V140	tg128	11.77	11.73	1

Edit:

llama-bench command:

export GGML_SYCL_PRIORITIZE_DMMV=0
export GGML_SYCL_DISABLE_OPT=0
${LLAMA_BUILD_DIR}/bin/llama-bench \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/gemma-2-2b-it-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/solar-10.7b-instruct-v1.0.Q6_K.gguf \
  -m /home/shared/llama.cpp-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p 512 \
  -n 128 \
  -pg 0,0 \
  -mmp 0 \
  -t 8 \
  -r 5 \
  -sm none \
  -ngl 99 \
  -o md

May 14 '25 16:05 Alcpz

All tests have passed. @Alcpz , @Rbiessy , could you please merge?

May 15 '25 14:05 lslusarczyk