llama.cpp Excessively slow prompt processing time with 70B partially offloaded in SYCL

Excessively slow prompt processing time with 70B partially offloaded in SYCL

Open Jacoby1218 opened this issue 1 year ago • 5 comments

prompt processing is extremely slow with a 70B partially offloaded. llama-bench.exe -ngl 20 -m "D:\models\lzlv_70b_fp16_hf.Q4_K_M.gguf" Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device

model	size	params	backend	ngl	test	t/s
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	SYCL	20	pp 512	2.14 ± 0.28
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	SYCL	20	tg 128	1.03 ± 0.01

build: a28c5eff (2045)

Feb 02 '24 04:02 Jacoby1218

hi @Jacoby1218 could you provide some reference data to show the magnitude of gaps? for example, performance on RTX-4070ti (16 GB), or entirely on iGPU/CPU?

Feb 02 '24 06:02 airMeng

I don't have any other GPU to test, but i can provide results from my CPU and other backends.

model	size	params	backend	threads	test	t/s
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	BLAS	6	pp 512	1.93 ± 0.06
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	BLAS	6	tg 128	0.81 ± 0.02

model	size	params	backend	ngl	test	t/s
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	Vulkan	20	pp 512	7.02 ± 0.25
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	Vulkan	20	tg 128	0.97 ± 0.04
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	OpenCL	20	pp 512	8.81 ± 1.10
llama 70B Q4_K - Medium	38.58 GiB	68.98 B	OpenCL	20	tg 128	0.82 ± 0.02

Feb 02 '24 07:02 Jacoby1218

I think this maybe due to lacking optimization on multi-batch, has been recordd in https://github.com/ggerganov/llama.cpp/discussions/5277, please stay tuned!

Feb 02 '24 08:02 airMeng

This issue is stale because it has been open for 30 days with no activity.

Mar 18 '24 01:03 github-actions[bot]

I think this has been improved with https://github.com/ggerganov/llama.cpp/pull/6217, please give a try.

Mar 24 '24 13:03 airMeng

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 09 '24 01:05 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

Excessively slow prompt processing time with 70B partially offloaded in SYCL

llama.cpp
llama.cpp copied to clipboard