llama.cpp
llama.cpp copied to clipboard
Unable to use Intel UHD GPU acceleration with BLAS
Expected Behavior
GPU should be used when infering
Current Behavior
Here's how I built the software :
git clone https://github.com/ggerganov/llama.cpp .
extracted w64devkit fortran somewhere, copied OpenBLAS required file in the folders
ran w64devkit.exe
cd to my llama.cpp folder
make LLAMA_OPENBLAS=1
then I followed the "Intel MKL" section below :
mkdir build
cd build
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build . --config Release
Finally I ran the app with :
.\build\bin\Release\main.exe -m ./models/7B/ggml-model-q4_0.bin -n 128 --interactive-first --color --threads 4 --mlock
But my igpu is at 2-3%, while my cpu is at 70-80%, when infering. Generation is a few words per second, on 7B, which is not bad for a bad/intel laptop cpu
Environment and Context
-
Physical hardware Windows 11, laptop i7 8565U 4c/8t, 16gb ram, Intel UHD 620
-
Operating System Windows 11 22H2 Python 3.11.3 cmake version 3.26.4
When loading main.exe with --mlock, gpu usage goes a bit higher, like 15%, but otherwise it never get used. I tried with a really long prompt too
Well I don't even know if there's even intel gpu acceleration support the readme structure is a mess, not gonna lie.
Last I checked Intel MKL is a CPU only library. It will not use the IGP.
Also, AFAIK the "BLAS" part is only used for prompt processing. The actual text generation uses custom code for CPUs and accelerators.
You could load the IGP with clblast, but it might not actually speed things up because of the extra copies. There is not really a backend specifically targeting IGPs yet.
Yeah, the documentation is a bit lacking.
The provided Windows build with CLBlast using OpenCL should work but I wouldn't expect any significant performance gains from integrated graphics.
copied OpenBLAS required file in the folders then I followed the "Intel MKL" section
Which one did you actually use? Did it actually find the Intel MKL library? Because OpenBLAS doesn't give you Intel MKL. They are completely different.
👋 TL;DR: CLBlast might not be faster. If there is a way to increase prompt evaluation on this system, I'd be highly interested.
I got a similar setup, though I use Koboldcpp (which is based on llama.cpp).
In general: Generation speed is fine (0.5-1 tok/s maybe). Prompt evaluation is horrible (2 tok/s). With my roleplay starting prompt of 1000 tokens, it takes easily 500s. Afterwards it generates quickly (~1-2 words/s) for 2-3 messages (because apparently it can cache the context). When it deletes older messages from the context, so it stays within the context limit, it suddenly needs to reevaluate 500-1000 tokens, which again takes minutes.
In Koboldcpp you can just select OpenBlas or CLBlast (GPU 1). I test with a 1087 token prompt and 87 generated tokens. I use a 13b q4_0 model. CPU/GPU percentages according to task manager (as far as I know those values can be hazy, especially for integrated GPUs).
OpenBlas will not use the GPU, CPU is at 80%. Needs 300s. CLBlast will use GPU at 50-100% (Switches) and 80% CPU first, then 40% CPU after some seconds. Needs 440s. Processing:390.8s (360ms/T), Generation:48.9s (753ms/T)
Next message will be quick (20s for 18 tokens evaluated and 26 generated). Time Taken - Processing:6.2s (342ms/T), Generation:14.1s (541ms/T), Total:20.2s
But after some messages, the previous prompt will change to accomodate the small context (I use Sillytavern, btw), and then it will regenerate much of the prompt, needing 300s again.
Physical hardware
Windows 10, tablet/laptop (Dell Latitude)
i5 8350U, 16gb ram, Intel UHD 620
Operating System
Windows 10 Pro
wave TL;DR: CLBlast might not be faster. If there is a way to increase prompt evaluation on this system, I'd be highly interested.
I got a similar setup, though I use Koboldcpp (which is based on llama.cpp).
In general: Generation speed is fine (0.5-1 tok/s maybe). Prompt evaluation is horrible (2 tok/s). With my roleplay starting prompt of 1000 tokens, it takes easily 500s. Afterwards it generates quickly (~1-2 words/s) for 2-3 messages (because apparently it can cache the context). When it deletes older messages from the context, so it stays within the context limit, it suddenly needs to reevaluate 500-1000 tokens, which again takes minutes.
In Koboldcpp you can just select OpenBlas or CLBlast (GPU 1). I test with a 1087 token prompt and 87 generated tokens. I use a 13b q4_0 model. CPU/GPU percentages according to task manager (as far as I know those values can be hazy, especially for integrated GPUs).
OpenBlas will not use the GPU, CPU is at 80%. Needs 300s. CLBlast will use GPU at 50-100% (Switches) and 80% CPU first, then 40% CPU after some seconds. Needs 440s. Processing:390.8s (360ms/T), Generation:48.9s (753ms/T)
Next message will be quick (20s for 18 tokens evaluated and 26 generated). Time Taken - Processing:6.2s (342ms/T), Generation:14.1s (541ms/T), Total:20.2s
But after some messages, the previous prompt will change to accomodate the small context (I use Sillytavern, btw), and then it will regenerate much of the prompt, needing 300s again.
Physical hardware Windows 10, tablet/laptop (Dell Latitude) i5 8350U, 16gb ram, Intel UHD 620 Operating System Windows 10 Pro
TBH you should test a Vulkan backend like mlc-llm. There isn't really a good way to leverage UHD 620 in llama.cpp yet, especially with max context prompts like that.
TBH you should test a Vulkan backend like mlc-llm. There isn't really a good way to leverage UHD 620 in llama.cpp yet, especially with max context prompts like that.
Thanks for the tip!
I tried it, but sadly it's slower than llama.cpp. :( But it does use the GPU to 100% (according to the task manager).
mlc-llm takes 220s to evaluate the prompt with their vicunia 7b. llama.cpp takes 161s to evaluate the prompt with a 7b 4bit model.
Also, from some llama.cpp tests: Evaluation the prompt on a 13b model takes 250s (only cpu) to 380s (with a lot of GPU). So two learnings
- Evaluating the same prompt on 13b takes longer than on 7b (I don't know why).
- The more the UHD GPU is used, the slower prompt evaluation gets. I guess because it takes away time from the CPU, and whatever llama.cpp does on the CPU is magic that is way faster than anything the GPU does...? I'd really like to know what's going on here. Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...? I really don't know.
Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...?
The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.
The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.
According to the task manager there's 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?
it has no vram. it's just the ram being used as vram the bios does allocate some for it, but it's more for legacy purpose afaik, it will just use whatever.
the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both
Le mar. 13 juin 2023, 23:20, Sunija @.***> a écrit :
The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.
According to the task manager there's 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1761#issuecomment-1590041841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZVYVBV3ZGZSEIYHR66QKY3XLDKRZANCNFSM6AAAAAAY7NBDEE . You are receiving this because you authored the thread.Message ID: @.***>
the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both
llama.cpp is not optimized for this yet, I don't think, so it will copy the data right now. But the UHD 620 is really slow anyway.
I'm mostly wondering is: A) Is it physically impossible to increase the speed by using the GPU or... B) ...is this just a software issue, because the current libraries don't use the parallelization of the integrated GPU correctly?
And would the speed-up bring the evaluation time down from 250s to ~60s. Everything else would be almost unusable again, so I wouldn't even bother.
I guess I feel mostly confused because before I thought the generation speed would be the limiting factor (as it seems to be on dedicated GPUs), not the prompt evaluation. :/
If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.
I'm mostly wondering is: A) Is it physically impossible to increase the speed by using the GPU or... B) ...is this just a software issue, because the current libraries don't use the parallelization of the integrated GPU correctly?
And would the speed-up bring the evaluation time down from 250s to ~60s. Everything else would be almost unusable again, so I wouldn't even bother.
I guess I feel mostly confused because before I thought the generation speed would be the limiting factor (as it seems to be on dedicated GPUs), not the prompt evaluation. :/
Theoretically some IGP specific OpenCL code to "partially" offload the CPU could be written:
https://laude.cloud/post/jupyter/
As they dont have to operate out of a seperate memory pool.
building according to clblast section instructions successfully on ubuntu-x64 with intel-6402p. here's output of running train-text-from-scratch example: ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics' ggml_opencl: selecting device: 'Intel(R) HD Graphics 510' ggml_opencl: device FP16 support: true main: init model ... used_mem model+cache: 1083036416 bytes main: begin training GGML_ASSERT: .../llama.cpp/ggml-opencl.cpp:1343: false Aborted
I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html
I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html
I think we should bring this issue back, iGPU offloading at least the prompt eval is very valuable
This issue was closed because it has been inactive for 14 days since being marked as stale.