Expected Behavior

GPU should be used when infering

Current Behavior

Here's how I built the software :

git clone https://github.com/ggerganov/llama.cpp . extracted w64devkit fortran somewhere, copied OpenBLAS required file in the folders ran w64devkit.exe cd to my llama.cpp folder make LLAMA_OPENBLAS=1

then I followed the "Intel MKL" section below :

mkdir build
cd build
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build . --config Release

Finally I ran the app with :

.\build\bin\Release\main.exe -m ./models/7B/ggml-model-q4_0.bin -n 128 --interactive-first --color --threads 4 --mlock

But my igpu is at 2-3%, while my cpu is at 70-80%, when infering. Generation is a few words per second, on 7B, which is not bad for a bad/intel laptop cpu

Environment and Context

Physical hardware Windows 11, laptop i7 8565U 4c/8t, 16gb ram, Intel UHD 620
Operating System Windows 11 22H2 Python 3.11.3 cmake version 3.26.4

Jun 08 '23 14:06 Foul-Tarnished

When loading main.exe with --mlock, gpu usage goes a bit higher, like 15%, but otherwise it never get used. I tried with a really long prompt too

Well I don't even know if there's even intel gpu acceleration support the readme structure is a mess, not gonna lie.

Jun 08 '23 14:06 Foul-Tarnished

Last I checked Intel MKL is a CPU only library. It will not use the IGP.

Also, AFAIK the "BLAS" part is only used for prompt processing. The actual text generation uses custom code for CPUs and accelerators.

You could load the IGP with clblast, but it might not actually speed things up because of the extra copies. There is not really a backend specifically targeting IGPs yet.

Yeah, the documentation is a bit lacking.

Jun 08 '23 16:06 AlphaAtlas

The provided Windows build with CLBlast using OpenCL should work but I wouldn't expect any significant performance gains from integrated graphics.

Jun 08 '23 18:06 SlyEcho

copied OpenBLAS required file in the folders then I followed the "Intel MKL" section

Which one did you actually use? Did it actually find the Intel MKL library? Because OpenBLAS doesn't give you Intel MKL. They are completely different.

Jun 08 '23 18:06 SlyEcho

👋 TL;DR: CLBlast might not be faster. If there is a way to increase prompt evaluation on this system, I'd be highly interested.

I got a similar setup, though I use Koboldcpp (which is based on llama.cpp).

In general: Generation speed is fine (0.5-1 tok/s maybe). Prompt evaluation is horrible (2 tok/s). With my roleplay starting prompt of 1000 tokens, it takes easily 500s. Afterwards it generates quickly (~1-2 words/s) for 2-3 messages (because apparently it can cache the context). When it deletes older messages from the context, so it stays within the context limit, it suddenly needs to reevaluate 500-1000 tokens, which again takes minutes.

In Koboldcpp you can just select OpenBlas or CLBlast (GPU 1). I test with a 1087 token prompt and 87 generated tokens. I use a 13b q4_0 model. CPU/GPU percentages according to task manager (as far as I know those values can be hazy, especially for integrated GPUs).

OpenBlas will not use the GPU, CPU is at 80%. Needs 300s. CLBlast will use GPU at 50-100% (Switches) and 80% CPU first, then 40% CPU after some seconds. Needs 440s. Processing:390.8s (360ms/T), Generation:48.9s (753ms/T)

Next message will be quick (20s for 18 tokens evaluated and 26 generated). Time Taken - Processing:6.2s (342ms/T), Generation:14.1s (541ms/T), Total:20.2s

But after some messages, the previous prompt will change to accomodate the small context (I use Sillytavern, btw), and then it will regenerate much of the prompt, needing 300s again.

Physical hardware
Windows 10, tablet/laptop (Dell Latitude)
i5 8350U, 16gb ram, Intel UHD 620

Operating System
Windows 10 Pro

Jun 11 '23 22:06 sunija-dev

wave TL;DR: CLBlast might not be faster. If there is a way to increase prompt evaluation on this system, I'd be highly interested.

I got a similar setup, though I use Koboldcpp (which is based on llama.cpp).

In general: Generation speed is fine (0.5-1 tok/s maybe). Prompt evaluation is horrible (2 tok/s). With my roleplay starting prompt of 1000 tokens, it takes easily 500s. Afterwards it generates quickly (~1-2 words/s) for 2-3 messages (because apparently it can cache the context). When it deletes older messages from the context, so it stays within the context limit, it suddenly needs to reevaluate 500-1000 tokens, which again takes minutes.

In Koboldcpp you can just select OpenBlas or CLBlast (GPU 1). I test with a 1087 token prompt and 87 generated tokens. I use a 13b q4_0 model. CPU/GPU percentages according to task manager (as far as I know those values can be hazy, especially for integrated GPUs).

OpenBlas will not use the GPU, CPU is at 80%. Needs 300s. CLBlast will use GPU at 50-100% (Switches) and 80% CPU first, then 40% CPU after some seconds. Needs 440s. Processing:390.8s (360ms/T), Generation:48.9s (753ms/T)

Next message will be quick (20s for 18 tokens evaluated and 26 generated). Time Taken - Processing:6.2s (342ms/T), Generation:14.1s (541ms/T), Total:20.2s

But after some messages, the previous prompt will change to accomodate the small context (I use Sillytavern, btw), and then it will regenerate much of the prompt, needing 300s again.
Physical hardware
Windows 10, tablet/laptop (Dell Latitude)
i5 8350U, 16gb ram, Intel UHD 620

Operating System
Windows 10 Pro

TBH you should test a Vulkan backend like mlc-llm. There isn't really a good way to leverage UHD 620 in llama.cpp yet, especially with max context prompts like that.

Jun 12 '23 17:06 AlphaAtlas

TBH you should test a Vulkan backend like mlc-llm. There isn't really a good way to leverage UHD 620 in llama.cpp yet, especially with max context prompts like that.

Thanks for the tip!

I tried it, but sadly it's slower than llama.cpp. :( But it does use the GPU to 100% (according to the task manager).

mlc-llm takes 220s to evaluate the prompt with their vicunia 7b. llama.cpp takes 161s to evaluate the prompt with a 7b 4bit model.

Also, from some llama.cpp tests: Evaluation the prompt on a 13b model takes 250s (only cpu) to 380s (with a lot of GPU). So two learnings

Evaluating the same prompt on 13b takes longer than on 7b (I don't know why).
The more the UHD GPU is used, the slower prompt evaluation gets. I guess because it takes away time from the CPU, and whatever llama.cpp does on the CPU is magic that is way faster than anything the GPU does...? I'd really like to know what's going on here. Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...? I really don't know.

Jun 13 '23 21:06 sunija-dev

Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...?

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.

Jun 13 '23 21:06 SlyEcho

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.

According to the task manager there's 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?

Jun 13 '23 21:06 sunija-dev

it has no vram. it's just the ram being used as vram the bios does allocate some for it, but it's more for legacy purpose afaik, it will just use whatever.

the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both

Le mar. 13 juin 2023, 23:20, Sunija @.***> a écrit :

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.

According to the task manager there's 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1761#issuecomment-1590041841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZVYVBV3ZGZSEIYHR66QKY3XLDKRZANCNFSM6AAAAAAY7NBDEE . You are receiving this because you authored the thread.Message ID: @.***>

Jun 13 '23 21:06 Foul-Tarnished

the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both

llama.cpp is not optimized for this yet, I don't think, so it will copy the data right now. But the UHD 620 is really slow anyway.

Jun 13 '23 21:06 SlyEcho

I'm mostly wondering is: A) Is it physically impossible to increase the speed by using the GPU or... B) ...is this just a software issue, because the current libraries don't use the parallelization of the integrated GPU correctly?

And would the speed-up bring the evaluation time down from 250s to ~60s. Everything else would be almost unusable again, so I wouldn't even bother.

I guess I feel mostly confused because before I thought the generation speed would be the limiting factor (as it seems to be on dedicated GPUs), not the prompt evaluation. :/

Jun 13 '23 21:06 sunija-dev

If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.

Jun 13 '23 22:06 SlyEcho

I'm mostly wondering is: A) Is it physically impossible to increase the speed by using the GPU or... B) ...is this just a software issue, because the current libraries don't use the parallelization of the integrated GPU correctly?

And would the speed-up bring the evaluation time down from 250s to ~60s. Everything else would be almost unusable again, so I wouldn't even bother.

I guess I feel mostly confused because before I thought the generation speed would be the limiting factor (as it seems to be on dedicated GPUs), not the prompt evaluation. :/

Theoretically some IGP specific OpenCL code to "partially" offload the CPU could be written:

https://laude.cloud/post/jupyter/

As they dont have to operate out of a seperate memory pool.

Jun 19 '23 13:06 AlphaAtlas

building according to clblast section instructions successfully on ubuntu-x64 with intel-6402p. here's output of running train-text-from-scratch example: ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics' ggml_opencl: selecting device: 'Intel(R) HD Graphics 510' ggml_opencl: device FP16 support: true main: init model ... used_mem model+cache: 1083036416 bytes main: begin training GGML_ASSERT: .../llama.cpp/ggml-opencl.cpp:1343: false Aborted

Jun 22 '23 09:06 aseok

I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html

Jul 10 '23 00:07 ColonelPhantom

I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html

I think we should bring this issue back, iGPU offloading at least the prompt eval is very valuable

Dec 02 '23 16:12 tikikun

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 10 '24 01:04 github-actions[bot]

Unable to use Intel UHD GPU acceleration with BLAS

Expected Behavior

Current Behavior

Environment and Context