llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

sycl: use oneDNN for matrices multiplication

Open lslusarczyk opened this issue 9 months ago • 2 comments

Summary:

  • SYCL is doing matrix multiplication using oneDNN instead of MKL
  • getting back to previous behavior by not using oneDNN and working with MKL can be forced by a new runtime flag: GGML_SYCL_DISABLE_DNN
  • compilation with oneDNN may be explicitly disabled by cmake flag: GGML_SYCL_DNN
  • several assertions were added, but they do not introduce any additional limitations, just document the code and limitations and assumptions which were already in the code before
  • some cleanups to use less code, remove unused variables

This is the first change towards making SYCL graphs working by avoiding using calls which issue sycl::wait inside recording phase. MKL uses waits a lot while oneDNN does not. Also using oneDNN should be preferred over using MKL in AI - hence the change.

Runtime flag is meant to ease performance comparison between MKL and oneDNN and is a safeguard for the custemers in case someone will find that oneDNN is slower in particulra case.

Compile time flag makes configuration of SYCL with oneDNN cleaner.

lslusarczyk avatar Apr 16 '25 11:04 lslusarczyk

  1. Please make sure the CI/UT is passed with the PR.
  2. There are several assert() to be added. Why add them? New code should support same cases of legacy code.

NeoZhangJianyu avatar Apr 21 '25 06:04 NeoZhangJianyu

  1. Please make sure the CI/UT is passed with the PR.

sure

  1. There are several assert() to be added. Why add them? New code should support same cases of legacy code.

I've updated description of this PR to explain this. There are no new limitations added. New assertions just document already existing assumptions or document dependencies in the code. Such an assertion is a self-verifying documentation of the code.

lslusarczyk avatar May 07 '25 08:05 lslusarczyk

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this https://github.com/ggml-org/llama.cpp/blob/ac04335e97acde1e8884d66de73ab031dc0d6519/ggml/src/ggml-sycl/ggml-sycl.cpp#L3935-L3937)

Alcpz avatar May 12 '25 12:05 Alcpz

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this

https://github.com/ggml-org/llama.cpp/blob/ac04335e97acde1e8884d66de73ab031dc0d6519/ggml/src/ggml-sycl/ggml-sycl.cpp#L3935-L3937 )

Yes. I've worked last days on resolving conflicts and fixing test cases which started to fail after rebasing into recent changes. All is rebased and works for me locally now. I will test performance on my own again and if all looks good I will ask you here for testing perf again.

lslusarczyk avatar May 12 '25 14:05 lslusarczyk

Here's the performance I got comparing to master using FP32:

model device test t/s bb1681fb (5376) t/s 53113d0a (5352) speedup
qwen2 1.5B Q4_0 Data Max 1100 pp512 5153.25 5176.24 1.0
qwen2 1.5B Q4_0 Data Max 1100 tg128 146.65 146.3 1.0
llama 7B Q4_0 Data Max 1100 pp512 1170.74 1170.12 1.0
llama 7B Q4_0 Data Max 1100 tg128 71.48 72.16 1.01
phi3 3B Q4_0 Data Max 1100 pp512 1909.14 1906.37 1.0
phi3 3B Q4_0 Data Max 1100 tg128 109.66 111.03 1.01
qwen2 1.5B Q4_0 Arc B580 pp512 3298.62 3327.74 1.01
qwen2 1.5B Q4_0 Arc B580 tg128 79.35 81.92 1.03
llama 7B Q4_0 Arc B580 pp512 781.32 781.11 1.0
llama 7B Q4_0 Arc B580 tg128 61.99 63.41 1.02
phi3 3B Q4_0 Arc B580 pp512 1259.04 1259.22 1.0
phi3 3B Q4_0 Arc B580 tg128 80.08 79.38 0.99
qwen2 1.5B Q4_0 Arc A770 pp512 3246.14 3268.02 1.01
qwen2 1.5B Q4_0 Arc A770 tg128 44.78 44.77 1.0
llama 7B Q4_0 Arc A770 pp512 790.7 789.74 1.0
llama 7B Q4_0 Arc A770 tg128 33.87 33.79 1.0
phi3 3B Q4_0 Arc A770 pp512 1257.26 1257.94 1.0
phi3 3B Q4_0 Arc A770 tg128 39.4 39.53 1.0
qwen2 1.5B Q4_0 Arc V140 pp512 886.79 884.95 1.0
qwen2 1.5B Q4_0 Arc V140 tg128 47.65 47.91 1.01
llama 7B Q4_0 Arc V140 pp512 207.99 208.04 1.0
llama 7B Q4_0 Arc V140 tg128 20.91 20.87 1.0
phi3 3B Q4_0 Arc V140 pp512 334.03 333.69 1.0
phi3 3B Q4_0 Arc V140 tg128 31.26 31.11 1.0

Speed up is calculated by dividing b5352/b5376 (this pr / master)

I'm re-runing using FP16 and some more models / quantizations, but so far looks good.

Alcpz avatar May 14 '25 14:05 Alcpz

Results for FP16:

So far results are roughly equivalent, even better in some cases. We have a couple of cases that may require looking a bit more into them, like qwen2 1.5B Q4_K - Medium on battlemage during TG and gemma2 2B Q4_K - Medium during prompt processing.

Benchmark Results
model device test t/s bb1681fb (5376) t/s 53113d0a (5352) speedup
qwen2 1.5B Q4_0 Data Max 1100 pp512 6981.83 7050.21 1.01
qwen2 1.5B Q4_0 Data Max 1100 tg128 143.85 144.1 1
qwen2 1.5B Q4_K - Medium Data Max 1100 pp512 10254.4 10172.7 0.99
qwen2 1.5B Q4_K - Medium Data Max 1100 tg128 107.35 107.43 1
llama 7B Q4_0 Data Max 1100 pp512 1751.95 1751.15 1
llama 7B Q4_0 Data Max 1100 tg128 71.06 71.28 1
llama 7B Q4_K - Medium Data Max 1100 pp512 3369.32 3378.41 1
llama 7B Q4_K - Medium Data Max 1100 tg128 38.98 38.99 1
gemma2 2B Q4_K - Medium Data Max 1100 pp512 7825.8 7798.84 1
gemma2 2B Q4_K - Medium Data Max 1100 tg128 82.58 82.4 1
phi3 3B Q4_0 Data Max 1100 pp512 2818.08 2825.46 1
phi3 3B Q4_0 Data Max 1100 tg128 108.64 109.39 1.01
phi3 3B Q4_K - Medium Data Max 1100 pp512 4505.99 4626.3 1.03
phi3 3B Q4_K - Medium Data Max 1100 tg128 65.41 65.43 1
llama 34B Q6_K Data Max 1100 pp512 1755.48 1763.75 1
llama 34B Q6_K Data Max 1100 tg128 23.53 23.3 0.99
llama 8B Q4_K - Medium Data Max 1100 pp512 3383.7 3338.83 0.99
llama 8B Q4_K - Medium Data Max 1100 tg128 36.26 35.93 0.99
qwen2 1.5B Q4_0 Arc B580 pp512 5959.04 6052.54 1.02
qwen2 1.5B Q4_0 Arc B580 tg128 138.28 138.52 1
qwen2 1.5B Q4_K - Medium Arc B580 pp512 7564.61 7502.2 0.99
qwen2 1.5B Q4_K - Medium Arc B580 tg128 106.54 100.88 0.95
llama 7B Q4_0 Arc B580 pp512 1624.43 1624.43 1
llama 7B Q4_0 Arc B580 tg128 65.43 65.09 0.99
llama 7B Q4_K - Medium Arc B580 pp512 2206.23 2201.93 1
llama 7B Q4_K - Medium Arc B580 tg128 38.3 38.2 1
gemma2 2B Q4_K - Medium Arc B580 pp512 5752.78 5722.25 0.99
gemma2 2B Q4_K - Medium Arc B580 tg128 75.26 72.35 0.96
phi3 3B Q4_0 Arc B580 pp512 2410.01 2408.05 1
phi3 3B Q4_0 Arc B580 tg128 100.07 95.97 0.96
phi3 3B Q4_K - Medium Arc B580 pp512 3167.63 3165.56 1
phi3 3B Q4_K - Medium Arc B580 tg128 55.43 54.65 0.99
llama 34B Q6_K Arc B580 pp512 1486.23 1483.83 1
llama 34B Q6_K Arc B580 tg128 20.43 20.27 0.99
llama 8B Q4_K - Medium Arc B580 pp512 2075.02 2067.28 1
llama 8B Q4_K - Medium Arc B580 tg128 34.68 34.06 0.98
qwen2 1.5B Q4_0 Arc A770 pp512 4082.11 4071.16 1
qwen2 1.5B Q4_0 Arc A770 tg128 44.44 44.5 1
qwen2 1.5B Q4_K - Medium Arc A770 pp512 4460.65 4458.04 1
qwen2 1.5B Q4_K - Medium Arc A770 tg128 43.85 44.37 1.01
llama 7B Q4_0 Arc A770 pp512 1468.23 1468.99 1
llama 7B Q4_0 Arc A770 tg128 33.55 33.42 1
llama 7B Q4_K - Medium Arc A770 pp512 1727.55 1727.02 1
llama 7B Q4_K - Medium Arc A770 tg128 25.85 25.72 0.99
gemma2 2B Q4_K - Medium Arc A770 pp512 3724.92 3623.25 0.97
gemma2 2B Q4_K - Medium Arc A770 tg128 35.11 35.06 1
phi3 3B Q4_0 Arc A770 pp512 2163.5 2165.57 1
phi3 3B Q4_0 Arc A770 tg128 37.23 37.48 1.01
phi3 3B Q4_K - Medium Arc A770 pp512 2502.81 2499.9 1
phi3 3B Q4_K - Medium Arc A770 tg128 29.86 29.91 1
llama 34B Q6_K Arc A770 pp512 1063.01 1029.61 0.97
llama 34B Q6_K Arc A770 tg128 15.42 14.95 0.97
llama 8B Q4_K - Medium Arc A770 pp512 1676.83 1620.34 0.97
llama 8B Q4_K - Medium Arc A770 tg128 23.19 22.64 0.98
qwen2 1.5B Q4_0 Arc V140 pp512 1100.25 1086.64 0.99
qwen2 1.5B Q4_0 Arc V140 tg128 47.9 47.55 0.99
qwen2 1.5B Q4_K - Medium Arc V140 pp512 1490 1512.93 1.02
qwen2 1.5B Q4_K - Medium Arc V140 tg128 37.24 37.26 1
llama 7B Q4_0 Arc V140 pp512 323.82 323.29 1
llama 7B Q4_0 Arc V140 tg128 20.49 20.94 1.02
llama 7B Q4_K - Medium Arc V140 pp512 548.41 549.69 1
llama 7B Q4_K - Medium Arc V140 tg128 13.35 13.37 1
gemma2 2B Q4_K - Medium Arc V140 pp512 647.98 594.44 0.92
gemma2 2B Q4_K - Medium Arc V140 tg128 25.2 25.2 1
phi3 3B Q4_0 Arc V140 pp512 517.41 517.47 1
phi3 3B Q4_0 Arc V140 tg128 31.41 31.41 1
phi3 3B Q4_K - Medium Arc V140 pp512 841.13 843.35 1
phi3 3B Q4_K - Medium Arc V140 tg128 21.19 21.21 1
llama 34B Q6_K Arc V140 pp512 368.28 367.2 1
llama 34B Q6_K Arc V140 tg128 7.34 7.33 1
llama 8B Q4_K - Medium Arc V140 pp512 507.21 504.61 0.99
llama 8B Q4_K - Medium Arc V140 tg128 11.77 11.73 1

Edit:

llama-bench command:
export GGML_SYCL_PRIORITIZE_DMMV=0
export GGML_SYCL_DISABLE_OPT=0
${LLAMA_BUILD_DIR}/bin/llama-bench \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/gemma-2-2b-it-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/solar-10.7b-instruct-v1.0.Q6_K.gguf \
  -m /home/shared/llama.cpp-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p 512 \
  -n 128 \
  -pg 0,0 \
  -mmp 0 \
  -t 8 \
  -r 5 \
  -sm none \
  -ngl 99 \
  -o md

Alcpz avatar May 14 '25 16:05 Alcpz

All tests have passed. @Alcpz , @Rbiessy , could you please merge?

lslusarczyk avatar May 15 '25 14:05 lslusarczyk