llama.cpp
llama.cpp copied to clipboard
CLBlast support
Add CLBlast support as an alternative to CuBLAS to speed up context processing.
The advantage of CLBlast over CuBLAS is that it is vendor-agnostic, it runs on basically any GPU (even some phones). It is also a much smaller library than proprietary CuBLAS, while managing to be nearly as fast.
Resolves #1059
This patch works fine for me on my Intel HD530 iGPU. CLBlast is slower than CPU with prompt ingestion speeds of ~330ms/token vs ~150ms/token on OpenBLAS.
Comparison between latest master with OpenBLAS processing dan.txt versus this PR with CLBlast.
OpenBLAS on Ryzen 2600:
llama_print_timings: prompt eval time = 35540.49 ms / 399 tokens ( 89.07 ms per token)
CLBlast on RX 570:
llama_print_timings: prompt eval time = 20087.81 ms / 399 tokens ( 50.35 ms per token)
In case anyone is concerned - Occ4m is the main developer for the code relating to the CLBlast kernels and implementation, and we are fine with this code being merged upstream under the MIT license. So there will not be any licensing incompatibilities with KoboldCpp.
I have some thoughts.
I think the header ggml-opencl.h should not have all that implementation-specific stuff in there. It should be moved to ggml-opencl.cpp, only the two function definitions that ggml.c uses should stay.
something like this: https://github.com/SlyEcho/llama.cpp/commit/9ff5ce85a28eab8cc82d959d9767181d931c2480
Thank you, I added most of those suggestions. You also found some leftover code from previous implementations that I hadn't caught.
Please add CUBLAS and CLBLAS to these lines: https://github.com/ggerganov/llama.cpp/blob/859fee6dfb00fab7ce6bc215b4adae78d82f4759/llama.cpp#L2392
Timings on my iGPU (AMD Accelerated Parallel Processing Device: gfx1035):
llama_print_timings: prompt eval time = 22031.28 ms / 290 tokens ( 75.97 ms per token)
CPU (8 physical cores):
llama_print_timings: prompt eval time = 30864.94 ms / 290 tokens ( 106.43 ms per token)
Could you show a profile of the launch, HTD, kernel (quantize, sgemm), and DTH times for a reasonable matmul size? (e.g. batch 512, matrix of 8192 X 8192).
I could investigate this separately; but there is an idea to improve the device to host copy by splitting up the sgemm into smaller batches and overlapping the DTH copy with the sgemm.
@jon-chuang Here is a trace from my Steam Deck using CLBlast I took earlier. You can open it in ui.perfetto.dev or chrome://tracing clblast hsa trace.json.gz
I don't know how heavy you want but this was dan.txt on 7B Q4_0
Seems the weights copy takes a while (and is also rather fragmented). Would be nice to seeing device-side weights cacheing as a next step.
For these iGPUs I wonder if OpenCL zero-copy is an option to reduce the impact of copying data in and out of memory. Don't think clblast supports this feature verbatim though.
Yesterday I performed the CLBlast tuning for the Steam Deck, I can check if there is a difference, it takes a few hours to do.
I'll have to rebase onto a newer version soon and implement the dequantization functions that have been added in the meantime. Should I do that or leave the PR as-is and add dequant kernels in a future PR?
Please add CUBLAS and CLBLAS to these lines:
https://github.com/ggerganov/llama.cpp/blob/859fee6dfb00fab7ce6bc215b4adae78d82f4759/llama.cpp#L2392
I think that output will get pretty crowded if we just add everything to it. Considering we are just adding a bunch of BLAS backends, I think it's fine if it just shows that BLAS is enabled, not which specific backend.
@ggerganov @slaren Anything else that's required here? I think we have reached a good state.
What could be done is I was thinking that all the different BLASs could be abstracted away from ggml.c so there would only generic calls like ggml_blas_alloc_mem()
ggml_blas_memcpy_host_device()
ggml_blas_dequantize()
ggml_blas_sgemm()
and this would work for OpenBLAS too because the allocation and memory copy would be noops.
That being said, I think it's better if this PR were merged first.
0cc4m Hi, i came across this conversation and I have a question - if I use igpu is there no useless copying of data between ram and dedicated to igrpu ram?
@Folko-Ven Sadly that is not the case. I tried implementing that to test it, using Intel's recommendations, but found that it slowed Nvidia down, led to OOM errors on Intel and was straight up not implemented for AMD. I am not sure if I did something wrong or if it is simply not well-supported on OpenCL. If you are interested in specifics of what I tried, you can look at the clblast-llama-cpp-igpu
branch on my fork.
Too bad. I'm not more worried about the extra performance, but about the extra memory used. Looks like I'll have to look for a laptop with dgpu. And I want to thank you again for this CLBlast implementation.
Wanted to add, it appears OpenCL performance on AMD is actually better with the opencl-mesa package instead of the opencl-amd package on Arch.
llama_print_timings: prompt eval time = 15324.17 ms / 399 tokens ( 38.41 ms per token)
(Roughly 10ms faster than opencl-amd)
@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?
@rabidcopy Interesting result. I thought the Mesa OpenCL driver wasn't really functional. Do you know which hardware is supported? Or did you use the new rusticl already?
No idea honestly. Using an RX 570 which is not ancient but not new either.
Using Platform: Clover Device: AMD Radeon RX 570 Series (polaris10, LLVM 15.0.7, DRM 3.49, 6.2.7-zen1-1-zen)
Has anyone compared speeds between Clover and rusticd OpenCL? Apparently rusticd OpenCL is getting merged into Mesa soon. Kinda curious if it would be worth going through the trouble to build Mesa from source or just wait.
@rabidcopy I tried, but Clover doesn't support my RX 6800 XT. I'll try to get rusticl to work and compare it with AMD's pro driver.
I got it to work, but rusticl was approximately 2x slower than the rocm-opencl-runtime for me.
Huh, very strange. For me I can't even use rocm-opencl-runtime as my card is too old.
@0cc4m is it in plans to add multi-gpu support like in CUDA refactor? https://github.com/ggerganov/llama.cpp/pull/1607/commits