llama.cpp Please compile also clblast version!

I have a laptop with nvidia GTX 970M... The version llama-b3078-bin-win-clblast-x64 seemed to do some offloading and speed up things.. using 3 gb of the onboard memory.. The other version I tested don't do that.

Jun 05 '24 10:06 Zibri

Yes, please! I have a similar notebook (GTX980M)... the only versions that work are the avx2 and the clblast! So now I am stuck to the latest version compiled with clblast ...

Jun 05 '24 11:06 0wwafa

GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.

Jun 05 '24 11:06 JohannesGaessler

Vulkan crashes at this time, while CLBlast worked, see #7769

Jun 05 '24 13:06 metal3d

with MSYS2 I just did:

git checkout 0cd6bd3483fa66124b76a8a8ac794d9ee18c70c1
pacman -Su mingw-w64-clang-x86_64-clblast
pacman -Su mingw-w64-clang-x86_64-opencl-clhpp
cmake -B build -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build -j $(nproc)

And the compilation worked.

llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5671.02 MiB
llm_load_tensors:     OpenCL buffer size =  1365.25 MiB

Jun 05 '24 13:06 0wwafa

GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.

I would but I don't have the skill to do that... what I know is that using MSYS2 and CLANG64 llama.cpp compiles perfectly. (until it was removed opencl support). And I can't make any other version use my GTX980M small memory to offload at least a few tensors.

Jun 05 '24 13:06 0wwafa

I'm interested in working on OpenCL back-end. But I would rewrite it rather than maintain CLBlast back-end that was recently removed.

Jun 05 '24 13:06 shibe2

If you want an explanation of the code and want to talk feel free to contact me via email.

Jun 05 '24 13:06 JohannesGaessler

I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.

Jun 05 '24 14:06 shibe2

I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.

That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.

Jun 05 '24 16:06 0cc4m

even an old laptop/old gpu could have some value with this enabled. cuda libraries are HUGE and in the end I did not notice any improvement in speed .. with clblast instead at least the gpu is used.

Jun 06 '24 11:06 0wwafa

I through they're removing CLblast support.

Jun 06 '24 11:06 sorasoras

I through they're removing CLblast support.

They are unless they find a maintainer.

Jun 06 '24 12:06 Zibri

It already was removed in 554c247caffed64465f372661f2826640cb10430, so to resolve this issue positively, it would be needed to add that back-end back.

Jun 06 '24 14:06 shibe2

Keep in mind that Vulkan backend still has memory allocation issues that affect RAM usage. This can prevent you from using 7b models in higher quants and larger models on 16GB RAM, for example. Simply removing Clblast before this issue is fixed was a bad idea.

Considering small footprint of Clblast implementation and relative unpopularity of MoE, it's better to revert the change.

Jun 08 '24 06:06 MaggotHATE

Just want to add that if Clblast returns, I'd be happy to volunteer my time as much as I can.

Jun 08 '24 06:06 Abhishek8394

What you can do right now is to make a fork with CLBlast back-end restored and maintain it.

One way to do it is to revert the commit 554c247caffed64465f372661f2826640cb10430 and in the future, rebase your branch on upstream. For example, I did that here: https://github.com/shibe2/llama.cpp/tree/clblast I restored only files that contain functional code to avoid dealing with additional conflicts when documentation will be changed. By the way, anyone who wants to keep using CLBlast back-end can use the patch from my example fork while it applies cleanly.

Another way is to fork at the last commit 0cd6bd3483fa66124b76a8a8ac794d9ee18c70c1 before removal of CLBlast and cherry-pick commits from upstream excluding the one that removes CLBlast.

Jun 08 '24 12:06 shibe2

@shibe2 Thank you! I was wondering if Clblast removal was somehow necessary for rope refactoring, but looks like it wasn't. Now I really don't see a reason for https://github.com/ggerganov/llama.cpp/commit/554c247caffed64465f372661f2826640cb10430.

I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.

Jun 08 '24 12:06 MaggotHATE

I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.

There are issues even with non-MoE models.

Jun 08 '24 12:06 JohannesGaessler

I haven't used MoE models, so I haven't looked into it. What other models had problems with CLBlast back-end?

Jun 08 '24 12:06 shibe2

There is this issue https://github.com/ggerganov/llama.cpp/issues/7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).

Jun 08 '24 13:06 JohannesGaessler

There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).

This looks more like an ARM-specific cornercase. At the same time, Clblast works correctly with Llama 2 and following models on x86-64. I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.

Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.

Jun 08 '24 13:06 MaggotHATE

I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.

I don't either but I do see reason to remove something that does require maintenance effort with no dev willing to put it in.

Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.

That's the state for essentially the entire project. If any of the core devs were to be run over by a bus there would be no one to replace them.

Jun 08 '24 13:06 JohannesGaessler

There is this issue #7661 regarding LLaMA 2.

Regarding it producing incorrect results, it may be a local issue, not necessarily a problem in llama.cpp code. We don't know because the back-end was nuked instead of investigating the problem.

Regarding CLBlast being slower than CPU, it is expected in some cases where RAM bandwidth is the performance bottleneck. CLBlast needs unquantized data, and so model parameters are dequantized and stored in RAM before multiplication. Unless it fully fits in cache, it adds memory writes and reads compared to doing the same computation on CPU. That's why I would write OpenCL back-end in a different way rather than keep relying on CLBlast.

Jun 08 '24 14:06 shibe2

That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.

Unfortunately AMD pulled the old bait-and-switch and nerfed this functionality for a large portion of their GPU lineup. Last I checked Steam was reporting the vast majority of systems with AMD GPUs fell into this camp.

I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel most AMD discrete GPUs either disabled the XNACK feature by default, or unsupport it outright. Even though the silicon theoretically appears to have this capabilities since GFX8 https://stackoverflow.com/questions/76700305/4000-performance-decrease-in-sycl-when-using-unified-shared-memory-instead-of-d

While clblast wasn't the most performant it was very robust when it came to squeezing every last byte into VRAM. It handled overallocation seamlessly. Vulkan is much more unforgiving and will crash my X server, requiring a forceful reloading of the amdgpu driver or a reboot. HIP is even worse and over allocating will often require a hard reboot - as in fully powering down the system and letting the caps discharge

Jun 08 '24 20:06 Haus1

Interesting. I was not using hipBLAS back-end because at that time it was wasting VRAM. But now it seems less memory-hungry.

As for power cycling being required, it may be because of a hardware or firmware problem. I used to have something like that with AMD GPU.

Jun 08 '24 20:06 shibe2

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

Jun 10 '24 14:06 Zibri

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

I completely agree, @Zibri

Jun 12 '24 08:06 0wwafa

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

This is an open source project. Noone here is opposed to having the backend, but someone has to put in the work, keep it up to date and fix bugs. Noone has worked on the OpenCL backend for a long time now, that's why it was removed. If you wanna take over maintaining it, you're welcome to.

Jun 12 '24 14:06 0cc4m

Noone has worked on the OpenCL backend for a long time now, that's why it was removed

If it was completely broken by this point - sure, but it's not. In fact, with Clblast removed there's no other "universal" backend to test Vulkan against. AFAIK all other backends are platform-specific.

Jun 12 '24 16:06 MaggotHATE

You can add it back for yourself while it is not broken. When it breaks, someone will need to update it.

Jun 12 '24 16:06 shibe2