llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

CLBlast support

Open 0cc4m opened this issue 1 year ago • 13 comments

Add CLBlast support as an alternative to CuBLAS to speed up context processing.

The advantage of CLBlast over CuBLAS is that it is vendor-agnostic, it runs on basically any GPU (even some phones). It is also a much smaller library than proprietary CuBLAS, while managing to be nearly as fast.

Resolves #1059

0cc4m avatar Apr 24 '23 20:04 0cc4m

This patch works fine for me on my Intel HD530 iGPU. CLBlast is slower than CPU with prompt ingestion speeds of ~330ms/token vs ~150ms/token on OpenBLAS.

ghost avatar Apr 24 '23 23:04 ghost

Comparison between latest master with OpenBLAS processing dan.txt versus this PR with CLBlast. OpenBLAS on Ryzen 2600: llama_print_timings: prompt eval time = 35540.49 ms / 399 tokens ( 89.07 ms per token) CLBlast on RX 570: llama_print_timings: prompt eval time = 20087.81 ms / 399 tokens ( 50.35 ms per token)

rabidcopy avatar Apr 25 '23 00:04 rabidcopy

In case anyone is concerned - Occ4m is the main developer for the code relating to the CLBlast kernels and implementation, and we are fine with this code being merged upstream under the MIT license. So there will not be any licensing incompatibilities with KoboldCpp.

LostRuins avatar Apr 25 '23 12:04 LostRuins

I have some thoughts.

I think the header ggml-opencl.h should not have all that implementation-specific stuff in there. It should be moved to ggml-opencl.cpp, only the two function definitions that ggml.c uses should stay.

something like this: https://github.com/SlyEcho/llama.cpp/commit/9ff5ce85a28eab8cc82d959d9767181d931c2480

SlyEcho avatar Apr 25 '23 20:04 SlyEcho

Thank you, I added most of those suggestions. You also found some leftover code from previous implementations that I hadn't caught.

0cc4m avatar Apr 26 '23 05:04 0cc4m

Please add CUBLAS and CLBLAS to these lines: https://github.com/ggerganov/llama.cpp/blob/859fee6dfb00fab7ce6bc215b4adae78d82f4759/llama.cpp#L2392

jon-chuang avatar Apr 26 '23 19:04 jon-chuang

Timings on my iGPU (AMD Accelerated Parallel Processing Device: gfx1035):

llama_print_timings: prompt eval time = 22031.28 ms /   290 tokens (   75.97 ms per token)

CPU (8 physical cores):

llama_print_timings: prompt eval time = 30864.94 ms /   290 tokens (  106.43 ms per token)

jon-chuang avatar Apr 26 '23 19:04 jon-chuang

Could you show a profile of the launch, HTD, kernel (quantize, sgemm), and DTH times for a reasonable matmul size? (e.g. batch 512, matrix of 8192 X 8192).

I could investigate this separately; but there is an idea to improve the device to host copy by splitting up the sgemm into smaller batches and overlapping the DTH copy with the sgemm.

jon-chuang avatar Apr 26 '23 19:04 jon-chuang

@jon-chuang Here is a trace from my Steam Deck using CLBlast I took earlier. You can open it in ui.perfetto.dev or chrome://tracing clblast hsa trace.json.gz

I don't know how heavy you want but this was dan.txt on 7B Q4_0

SlyEcho avatar Apr 26 '23 19:04 SlyEcho

image

Seems the weights copy takes a while (and is also rather fragmented). Would be nice to seeing device-side weights cacheing as a next step.

jon-chuang avatar Apr 26 '23 20:04 jon-chuang

For these iGPUs I wonder if OpenCL zero-copy is an option to reduce the impact of copying data in and out of memory. Don't think clblast supports this feature verbatim though.

ghost avatar Apr 26 '23 21:04 ghost

Yesterday I performed the CLBlast tuning for the Steam Deck, I can check if there is a difference, it takes a few hours to do.

SlyEcho avatar Apr 27 '23 07:04 SlyEcho

I'll have to rebase onto a newer version soon and implement the dequantization functions that have been added in the meantime. Should I do that or leave the PR as-is and add dequant kernels in a future PR?

0cc4m avatar Apr 27 '23 13:04 0cc4m

Please add CUBLAS and CLBLAS to these lines:

https://github.com/ggerganov/llama.cpp/blob/859fee6dfb00fab7ce6bc215b4adae78d82f4759/llama.cpp#L2392

I think that output will get pretty crowded if we just add everything to it. Considering we are just adding a bunch of BLAS backends, I think it's fine if it just shows that BLAS is enabled, not which specific backend.

0cc4m avatar Apr 27 '23 20:04 0cc4m

@ggerganov @slaren Anything else that's required here? I think we have reached a good state.

0cc4m avatar Apr 28 '23 07:04 0cc4m

What could be done is I was thinking that all the different BLASs could be abstracted away from ggml.c so there would only generic calls like ggml_blas_alloc_mem() ggml_blas_memcpy_host_device() ggml_blas_dequantize() ggml_blas_sgemm() and this would work for OpenBLAS too because the allocation and memory copy would be noops.

That being said, I think it's better if this PR were merged first.

SlyEcho avatar Apr 28 '23 08:04 SlyEcho

0cc4m Hi, i came across this conversation and I have a question - if I use igpu is there no useless copying of data between ram and dedicated to igrpu ram?

Folko-Ven avatar Apr 28 '23 11:04 Folko-Ven

@Folko-Ven Sadly that is not the case. I tried implementing that to test it, using Intel's recommendations, but found that it slowed Nvidia down, led to OOM errors on Intel and was straight up not implemented for AMD. I am not sure if I did something wrong or if it is simply not well-supported on OpenCL. If you are interested in specifics of what I tried, you can look at the clblast-llama-cpp-igpu branch on my fork.

0cc4m avatar Apr 28 '23 12:04 0cc4m

Too bad. I'm not more worried about the extra performance, but about the extra memory used. Looks like I'll have to look for a laptop with dgpu. And I want to thank you again for this CLBlast implementation.

Folko-Ven avatar Apr 28 '23 13:04 Folko-Ven

Wanted to add, it appears OpenCL performance on AMD is actually better with the opencl-mesa package instead of the opencl-amd package on Arch. llama_print_timings: prompt eval time = 15324.17 ms / 399 tokens ( 38.41 ms per token) (Roughly 10ms faster than opencl-amd)

rabidcopy avatar Apr 28 '23 15:04 rabidcopy

@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?

0cc4m avatar Apr 28 '23 15:04 0cc4m

@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?

No idea honestly. Using an RX 570 which is not ancient but not new either. Using Platform: Clover Device: AMD Radeon RX 570 Series (polaris10, LLVM 15.0.7, DRM 3.49, 6.2.7-zen1-1-zen)

rabidcopy avatar Apr 28 '23 15:04 rabidcopy

Has anyone compared speeds between Clover and rusticd OpenCL? Apparently rusticd OpenCL is getting merged into Mesa soon. Kinda curious if it would be worth going through the trouble to build Mesa from source or just wait.

rabidcopy avatar Apr 28 '23 22:04 rabidcopy

@rabidcopy I tried, but Clover doesn't support my RX 6800 XT. I'll try to get rusticl to work and compare it with AMD's pro driver.

0cc4m avatar Apr 29 '23 05:04 0cc4m

I got it to work, but rusticl was approximately 2x slower than the rocm-opencl-runtime for me.

0cc4m avatar Apr 29 '23 07:04 0cc4m

Huh, very strange. For me I can't even use rocm-opencl-runtime as my card is too old.

rabidcopy avatar Apr 29 '23 08:04 rabidcopy

@0cc4m is it in plans to add multi-gpu support like in CUDA refactor? https://github.com/ggerganov/llama.cpp/pull/1607/commits

Eliastrt avatar Jun 06 '23 16:06 Eliastrt