sycl: use oneDNN for matrices multiplication
Summary:
- SYCL is doing matrix multiplication using oneDNN instead of MKL
- getting back to previous behavior by not using oneDNN and working with MKL can be forced by a new runtime flag: GGML_SYCL_DISABLE_DNN
- compilation with oneDNN may be explicitly disabled by cmake flag: GGML_SYCL_DNN
- several assertions were added, but they do not introduce any additional limitations, just document the code and limitations and assumptions which were already in the code before
- some cleanups to use less code, remove unused variables
This is the first change towards making SYCL graphs working by avoiding using calls which issue sycl::wait inside recording phase. MKL uses waits a lot while oneDNN does not. Also using oneDNN should be preferred over using MKL in AI - hence the change.
Runtime flag is meant to ease performance comparison between MKL and oneDNN and is a safeguard for the custemers in case someone will find that oneDNN is slower in particulra case.
Compile time flag makes configuration of SYCL with oneDNN cleaner.
- Please make sure the CI/UT is passed with the PR.
- There are several assert() to be added. Why add them? New code should support same cases of legacy code.
- Please make sure the CI/UT is passed with the PR.
sure
- There are several assert() to be added. Why add them? New code should support same cases of legacy code.
I've updated description of this PR to explain this. There are no new limitations added. New assertions just document already existing assumptions or document dependencies in the code. Such an assertion is a self-verifying documentation of the code.
@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this https://github.com/ggml-org/llama.cpp/blob/ac04335e97acde1e8884d66de73ab031dc0d6519/ggml/src/ggml-sycl/ggml-sycl.cpp#L3935-L3937)
@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this
https://github.com/ggml-org/llama.cpp/blob/ac04335e97acde1e8884d66de73ab031dc0d6519/ggml/src/ggml-sycl/ggml-sycl.cpp#L3935-L3937 )
Yes. I've worked last days on resolving conflicts and fixing test cases which started to fail after rebasing into recent changes. All is rebased and works for me locally now. I will test performance on my own again and if all looks good I will ask you here for testing perf again.
Here's the performance I got comparing to master using FP32:
| model | device | test | t/s bb1681fb (5376) | t/s 53113d0a (5352) | speedup |
|---|---|---|---|---|---|
| qwen2 1.5B Q4_0 | Data Max 1100 | pp512 | 5153.25 | 5176.24 | 1.0 |
| qwen2 1.5B Q4_0 | Data Max 1100 | tg128 | 146.65 | 146.3 | 1.0 |
| llama 7B Q4_0 | Data Max 1100 | pp512 | 1170.74 | 1170.12 | 1.0 |
| llama 7B Q4_0 | Data Max 1100 | tg128 | 71.48 | 72.16 | 1.01 |
| phi3 3B Q4_0 | Data Max 1100 | pp512 | 1909.14 | 1906.37 | 1.0 |
| phi3 3B Q4_0 | Data Max 1100 | tg128 | 109.66 | 111.03 | 1.01 |
| qwen2 1.5B Q4_0 | Arc B580 | pp512 | 3298.62 | 3327.74 | 1.01 |
| qwen2 1.5B Q4_0 | Arc B580 | tg128 | 79.35 | 81.92 | 1.03 |
| llama 7B Q4_0 | Arc B580 | pp512 | 781.32 | 781.11 | 1.0 |
| llama 7B Q4_0 | Arc B580 | tg128 | 61.99 | 63.41 | 1.02 |
| phi3 3B Q4_0 | Arc B580 | pp512 | 1259.04 | 1259.22 | 1.0 |
| phi3 3B Q4_0 | Arc B580 | tg128 | 80.08 | 79.38 | 0.99 |
| qwen2 1.5B Q4_0 | Arc A770 | pp512 | 3246.14 | 3268.02 | 1.01 |
| qwen2 1.5B Q4_0 | Arc A770 | tg128 | 44.78 | 44.77 | 1.0 |
| llama 7B Q4_0 | Arc A770 | pp512 | 790.7 | 789.74 | 1.0 |
| llama 7B Q4_0 | Arc A770 | tg128 | 33.87 | 33.79 | 1.0 |
| phi3 3B Q4_0 | Arc A770 | pp512 | 1257.26 | 1257.94 | 1.0 |
| phi3 3B Q4_0 | Arc A770 | tg128 | 39.4 | 39.53 | 1.0 |
| qwen2 1.5B Q4_0 | Arc V140 | pp512 | 886.79 | 884.95 | 1.0 |
| qwen2 1.5B Q4_0 | Arc V140 | tg128 | 47.65 | 47.91 | 1.01 |
| llama 7B Q4_0 | Arc V140 | pp512 | 207.99 | 208.04 | 1.0 |
| llama 7B Q4_0 | Arc V140 | tg128 | 20.91 | 20.87 | 1.0 |
| phi3 3B Q4_0 | Arc V140 | pp512 | 334.03 | 333.69 | 1.0 |
| phi3 3B Q4_0 | Arc V140 | tg128 | 31.26 | 31.11 | 1.0 |
Speed up is calculated by dividing b5352/b5376 (this pr / master)
I'm re-runing using FP16 and some more models / quantizations, but so far looks good.
Results for FP16:
So far results are roughly equivalent, even better in some cases. We have a couple of cases that may require looking a bit more into them, like qwen2 1.5B Q4_K - Medium on battlemage during TG and gemma2 2B Q4_K - Medium during prompt processing.
Benchmark Results
| model | device | test | t/s bb1681fb (5376) | t/s 53113d0a (5352) | speedup |
|---|---|---|---|---|---|
| qwen2 1.5B Q4_0 | Data Max 1100 | pp512 | 6981.83 | 7050.21 | 1.01 |
| qwen2 1.5B Q4_0 | Data Max 1100 | tg128 | 143.85 | 144.1 | 1 |
| qwen2 1.5B Q4_K - Medium | Data Max 1100 | pp512 | 10254.4 | 10172.7 | 0.99 |
| qwen2 1.5B Q4_K - Medium | Data Max 1100 | tg128 | 107.35 | 107.43 | 1 |
| llama 7B Q4_0 | Data Max 1100 | pp512 | 1751.95 | 1751.15 | 1 |
| llama 7B Q4_0 | Data Max 1100 | tg128 | 71.06 | 71.28 | 1 |
| llama 7B Q4_K - Medium | Data Max 1100 | pp512 | 3369.32 | 3378.41 | 1 |
| llama 7B Q4_K - Medium | Data Max 1100 | tg128 | 38.98 | 38.99 | 1 |
| gemma2 2B Q4_K - Medium | Data Max 1100 | pp512 | 7825.8 | 7798.84 | 1 |
| gemma2 2B Q4_K - Medium | Data Max 1100 | tg128 | 82.58 | 82.4 | 1 |
| phi3 3B Q4_0 | Data Max 1100 | pp512 | 2818.08 | 2825.46 | 1 |
| phi3 3B Q4_0 | Data Max 1100 | tg128 | 108.64 | 109.39 | 1.01 |
| phi3 3B Q4_K - Medium | Data Max 1100 | pp512 | 4505.99 | 4626.3 | 1.03 |
| phi3 3B Q4_K - Medium | Data Max 1100 | tg128 | 65.41 | 65.43 | 1 |
| llama 34B Q6_K | Data Max 1100 | pp512 | 1755.48 | 1763.75 | 1 |
| llama 34B Q6_K | Data Max 1100 | tg128 | 23.53 | 23.3 | 0.99 |
| llama 8B Q4_K - Medium | Data Max 1100 | pp512 | 3383.7 | 3338.83 | 0.99 |
| llama 8B Q4_K - Medium | Data Max 1100 | tg128 | 36.26 | 35.93 | 0.99 |
| qwen2 1.5B Q4_0 | Arc B580 | pp512 | 5959.04 | 6052.54 | 1.02 |
| qwen2 1.5B Q4_0 | Arc B580 | tg128 | 138.28 | 138.52 | 1 |
| qwen2 1.5B Q4_K - Medium | Arc B580 | pp512 | 7564.61 | 7502.2 | 0.99 |
| qwen2 1.5B Q4_K - Medium | Arc B580 | tg128 | 106.54 | 100.88 | 0.95 |
| llama 7B Q4_0 | Arc B580 | pp512 | 1624.43 | 1624.43 | 1 |
| llama 7B Q4_0 | Arc B580 | tg128 | 65.43 | 65.09 | 0.99 |
| llama 7B Q4_K - Medium | Arc B580 | pp512 | 2206.23 | 2201.93 | 1 |
| llama 7B Q4_K - Medium | Arc B580 | tg128 | 38.3 | 38.2 | 1 |
| gemma2 2B Q4_K - Medium | Arc B580 | pp512 | 5752.78 | 5722.25 | 0.99 |
| gemma2 2B Q4_K - Medium | Arc B580 | tg128 | 75.26 | 72.35 | 0.96 |
| phi3 3B Q4_0 | Arc B580 | pp512 | 2410.01 | 2408.05 | 1 |
| phi3 3B Q4_0 | Arc B580 | tg128 | 100.07 | 95.97 | 0.96 |
| phi3 3B Q4_K - Medium | Arc B580 | pp512 | 3167.63 | 3165.56 | 1 |
| phi3 3B Q4_K - Medium | Arc B580 | tg128 | 55.43 | 54.65 | 0.99 |
| llama 34B Q6_K | Arc B580 | pp512 | 1486.23 | 1483.83 | 1 |
| llama 34B Q6_K | Arc B580 | tg128 | 20.43 | 20.27 | 0.99 |
| llama 8B Q4_K - Medium | Arc B580 | pp512 | 2075.02 | 2067.28 | 1 |
| llama 8B Q4_K - Medium | Arc B580 | tg128 | 34.68 | 34.06 | 0.98 |
| qwen2 1.5B Q4_0 | Arc A770 | pp512 | 4082.11 | 4071.16 | 1 |
| qwen2 1.5B Q4_0 | Arc A770 | tg128 | 44.44 | 44.5 | 1 |
| qwen2 1.5B Q4_K - Medium | Arc A770 | pp512 | 4460.65 | 4458.04 | 1 |
| qwen2 1.5B Q4_K - Medium | Arc A770 | tg128 | 43.85 | 44.37 | 1.01 |
| llama 7B Q4_0 | Arc A770 | pp512 | 1468.23 | 1468.99 | 1 |
| llama 7B Q4_0 | Arc A770 | tg128 | 33.55 | 33.42 | 1 |
| llama 7B Q4_K - Medium | Arc A770 | pp512 | 1727.55 | 1727.02 | 1 |
| llama 7B Q4_K - Medium | Arc A770 | tg128 | 25.85 | 25.72 | 0.99 |
| gemma2 2B Q4_K - Medium | Arc A770 | pp512 | 3724.92 | 3623.25 | 0.97 |
| gemma2 2B Q4_K - Medium | Arc A770 | tg128 | 35.11 | 35.06 | 1 |
| phi3 3B Q4_0 | Arc A770 | pp512 | 2163.5 | 2165.57 | 1 |
| phi3 3B Q4_0 | Arc A770 | tg128 | 37.23 | 37.48 | 1.01 |
| phi3 3B Q4_K - Medium | Arc A770 | pp512 | 2502.81 | 2499.9 | 1 |
| phi3 3B Q4_K - Medium | Arc A770 | tg128 | 29.86 | 29.91 | 1 |
| llama 34B Q6_K | Arc A770 | pp512 | 1063.01 | 1029.61 | 0.97 |
| llama 34B Q6_K | Arc A770 | tg128 | 15.42 | 14.95 | 0.97 |
| llama 8B Q4_K - Medium | Arc A770 | pp512 | 1676.83 | 1620.34 | 0.97 |
| llama 8B Q4_K - Medium | Arc A770 | tg128 | 23.19 | 22.64 | 0.98 |
| qwen2 1.5B Q4_0 | Arc V140 | pp512 | 1100.25 | 1086.64 | 0.99 |
| qwen2 1.5B Q4_0 | Arc V140 | tg128 | 47.9 | 47.55 | 0.99 |
| qwen2 1.5B Q4_K - Medium | Arc V140 | pp512 | 1490 | 1512.93 | 1.02 |
| qwen2 1.5B Q4_K - Medium | Arc V140 | tg128 | 37.24 | 37.26 | 1 |
| llama 7B Q4_0 | Arc V140 | pp512 | 323.82 | 323.29 | 1 |
| llama 7B Q4_0 | Arc V140 | tg128 | 20.49 | 20.94 | 1.02 |
| llama 7B Q4_K - Medium | Arc V140 | pp512 | 548.41 | 549.69 | 1 |
| llama 7B Q4_K - Medium | Arc V140 | tg128 | 13.35 | 13.37 | 1 |
| gemma2 2B Q4_K - Medium | Arc V140 | pp512 | 647.98 | 594.44 | 0.92 |
| gemma2 2B Q4_K - Medium | Arc V140 | tg128 | 25.2 | 25.2 | 1 |
| phi3 3B Q4_0 | Arc V140 | pp512 | 517.41 | 517.47 | 1 |
| phi3 3B Q4_0 | Arc V140 | tg128 | 31.41 | 31.41 | 1 |
| phi3 3B Q4_K - Medium | Arc V140 | pp512 | 841.13 | 843.35 | 1 |
| phi3 3B Q4_K - Medium | Arc V140 | tg128 | 21.19 | 21.21 | 1 |
| llama 34B Q6_K | Arc V140 | pp512 | 368.28 | 367.2 | 1 |
| llama 34B Q6_K | Arc V140 | tg128 | 7.34 | 7.33 | 1 |
| llama 8B Q4_K - Medium | Arc V140 | pp512 | 507.21 | 504.61 | 0.99 |
| llama 8B Q4_K - Medium | Arc V140 | tg128 | 11.77 | 11.73 | 1 |
Edit:
llama-bench command:
export GGML_SYCL_PRIORITIZE_DMMV=0
export GGML_SYCL_DISABLE_OPT=0
${LLAMA_BUILD_DIR}/bin/llama-bench \
-m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf \
-m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_0.gguf \
-m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/gemma-2-2b-it-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_0.gguf \
-m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/solar-10.7b-instruct-v1.0.Q6_K.gguf \
-m /home/shared/llama.cpp-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p 512 \
-n 128 \
-pg 0,0 \
-mmp 0 \
-t 8 \
-r 5 \
-sm none \
-ngl 99 \
-o md
All tests have passed. @Alcpz , @Rbiessy , could you please merge?