Please compile also clblast version!
I have a laptop with nvidia GTX 970M... The version llama-b3078-bin-win-clblast-x64 seemed to do some offloading and speed up things.. using 3 gb of the onboard memory.. The other version I tested don't do that.
Yes, please! I have a similar notebook (GTX980M)... the only versions that work are the avx2 and the clblast! So now I am stuck to the latest version compiled with clblast ...
GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.
Vulkan crashes at this time, while CLBlast worked, see #7769
with MSYS2 I just did:
git checkout 0cd6bd3483fa66124b76a8a8ac794d9ee18c70c1
pacman -Su mingw-w64-clang-x86_64-clblast
pacman -Su mingw-w64-clang-x86_64-opencl-clhpp
cmake -B build -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build -j $(nproc)
And the compilation worked.
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors: CPU buffer size = 5671.02 MiB
llm_load_tensors: OpenCL buffer size = 1365.25 MiB
GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.
I would but I don't have the skill to do that... what I know is that using MSYS2 and CLANG64 llama.cpp compiles perfectly. (until it was removed opencl support). And I can't make any other version use my GTX980M small memory to offload at least a few tensors.
I'm interested in working on OpenCL back-end. But I would rewrite it rather than maintain CLBlast back-end that was recently removed.
If you want an explanation of the code and want to talk feel free to contact me via email.
I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.
I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.
That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.
even an old laptop/old gpu could have some value with this enabled. cuda libraries are HUGE and in the end I did not notice any improvement in speed .. with clblast instead at least the gpu is used.
I through they're removing CLblast support.
I through they're removing CLblast support.
They are unless they find a maintainer.
It already was removed in 554c247caffed64465f372661f2826640cb10430, so to resolve this issue positively, it would be needed to add that back-end back.
Keep in mind that Vulkan backend still has memory allocation issues that affect RAM usage. This can prevent you from using 7b models in higher quants and larger models on 16GB RAM, for example. Simply removing Clblast before this issue is fixed was a bad idea.
Considering small footprint of Clblast implementation and relative unpopularity of MoE, it's better to revert the change.
Just want to add that if Clblast returns, I'd be happy to volunteer my time as much as I can.
What you can do right now is to make a fork with CLBlast back-end restored and maintain it.
One way to do it is to revert the commit 554c247caffed64465f372661f2826640cb10430 and in the future, rebase your branch on upstream. For example, I did that here: https://github.com/shibe2/llama.cpp/tree/clblast I restored only files that contain functional code to avoid dealing with additional conflicts when documentation will be changed. By the way, anyone who wants to keep using CLBlast back-end can use the patch from my example fork while it applies cleanly.
Another way is to fork at the last commit 0cd6bd3483fa66124b76a8a8ac794d9ee18c70c1 before removal of CLBlast and cherry-pick commits from upstream excluding the one that removes CLBlast.
@shibe2 Thank you! I was wondering if Clblast removal was somehow necessary for rope refactoring, but looks like it wasn't. Now I really don't see a reason for https://github.com/ggerganov/llama.cpp/commit/554c247caffed64465f372661f2826640cb10430.
I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.
I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.
There are issues even with non-MoE models.
I haven't used MoE models, so I haven't looked into it. What other models had problems with CLBlast back-end?
There is this issue https://github.com/ggerganov/llama.cpp/issues/7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).
There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).
This looks more like an ARM-specific cornercase. At the same time, Clblast works correctly with Llama 2 and following models on x86-64. I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.
Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.
I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.
I don't either but I do see reason to remove something that does require maintenance effort with no dev willing to put it in.
Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.
That's the state for essentially the entire project. If any of the core devs were to be run over by a bus there would be no one to replace them.
There is this issue #7661 regarding LLaMA 2.
Regarding it producing incorrect results, it may be a local issue, not necessarily a problem in llama.cpp code. We don't know because the back-end was nuked instead of investigating the problem.
Regarding CLBlast being slower than CPU, it is expected in some cases where RAM bandwidth is the performance bottleneck. CLBlast needs unquantized data, and so model parameters are dequantized and stored in RAM before multiplication. Unless it fully fits in cache, it adds memory writes and reads compared to doing the same computation on CPU. That's why I would write OpenCL back-end in a different way rather than keep relying on CLBlast.
That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.
Unfortunately AMD pulled the old bait-and-switch and nerfed this functionality for a large portion of their GPU lineup. Last I checked Steam was reporting the vast majority of systems with AMD GPUs fell into this camp.
I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel most AMD discrete GPUs either disabled the XNACK feature by default, or unsupport it outright. Even though the silicon theoretically appears to have this capabilities since GFX8 https://stackoverflow.com/questions/76700305/4000-performance-decrease-in-sycl-when-using-unified-shared-memory-instead-of-d
While clblast wasn't the most performant it was very robust when it came to squeezing every last byte into VRAM. It handled overallocation seamlessly. Vulkan is much more unforgiving and will crash my X server, requiring a forceful reloading of the amdgpu driver or a reboot. HIP is even worse and over allocating will often require a hard reboot - as in fully powering down the system and letting the caps discharge
Interesting. I was not using hipBLAS back-end because at that time it was wasting VRAM. But now it seems less memory-hungry.
As for power cycling being required, it may be because of a hardware or firmware problem. I used to have something like that with AMD GPU.
I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.
I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.
I completely agree, @Zibri
I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.
This is an open source project. Noone here is opposed to having the backend, but someone has to put in the work, keep it up to date and fix bugs. Noone has worked on the OpenCL backend for a long time now, that's why it was removed. If you wanna take over maintaining it, you're welcome to.
Noone has worked on the OpenCL backend for a long time now, that's why it was removed
If it was completely broken by this point - sure, but it's not. In fact, with Clblast removed there's no other "universal" backend to test Vulkan against. AFAIK all other backends are platform-specific.
You can add it back for yourself while it is not broken. When it breaks, someone will need to update it.