llama.cpp
llama.cpp copied to clipboard
Excessively slow prompt processing time with 70B partially offloaded in SYCL
prompt processing is extremely slow with a 70B partially offloaded.
llama-bench.exe -ngl 20 -m "D:\models\lzlv_70b_fp16_hf.Q4_K_M.gguf"
Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | SYCL | 20 | pp 512 | 2.14 ± 0.28 |
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | SYCL | 20 | tg 128 | 1.03 ± 0.01 |
build: a28c5eff (2045)
hi @Jacoby1218 could you provide some reference data to show the magnitude of gaps? for example, performance on RTX-4070ti (16 GB), or entirely on iGPU/CPU?
I don't have any other GPU to test, but i can provide results from my CPU and other backends.
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | BLAS | 6 | pp 512 | 1.93 ± 0.06 |
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | BLAS | 6 | tg 128 | 0.81 ± 0.02 |
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | Vulkan | 20 | pp 512 | 7.02 ± 0.25 |
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | Vulkan | 20 | tg 128 | 0.97 ± 0.04 |
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | OpenCL | 20 | pp 512 | 8.81 ± 1.10 |
llama 70B Q4_K - Medium | 38.58 GiB | 68.98 B | OpenCL | 20 | tg 128 | 0.82 ± 0.02 |
I think this maybe due to lacking optimization on multi-batch, has been recordd in https://github.com/ggerganov/llama.cpp/discussions/5277, please stay tuned!
This issue is stale because it has been open for 30 days with no activity.
I think this has been improved with https://github.com/ggerganov/llama.cpp/pull/6217, please give a try.
This issue was closed because it has been inactive for 14 days since being marked as stale.