chatllm.cpp How to use GPU?

trafficstars

May 08 '24 05:05 li904775857

This is dedicated to those who are GPU-poor, but stay tuned. 😄

May 08 '24 06:05 foldl

@foldl I tried to build it with GPU support by using cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON (it works fine for llama.cpp), but compilation fails with errors like that:

  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified

Env: Windows 11, MSVC 2022, CUDA 12.5.

Jul 24 '24 11:07 MoonRide303

@foldl I tried to build it with GPU support by using cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON (it works fine for llama.cpp), but compilation fails with errors like that:
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
Env: Windows 11, MSVC 2022, CUDA 12.5.

I think this is fixed in 3dd468dbf621ed8e8cdb10caf84bd1f5891695c2.

Feb 07 '25 11:02 foldl

@foldl I tried to build it with GPU support by using cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON (it works fine for llama.cpp), but compilation fails with errors like that:
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
Env: Windows 11, MSVC 2022, CUDA 12.5.
I think this is fixed in 3dd468d.

I tried it on current master (14c71642d384e98de9964c3e0a017a28b549b9e4), problem is still there - the same error message for .cu files, like that:

Error log

  D:\repos-git\chatllm.cpp\build-gpu\ggml\src\ggml-cuda>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\n
  vcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\b
  in\HostX64\x64" -x cu   -I"D:\repos-git\chatllm.cpp\ggml\include\ggml" -I"D:\repos-git\chatllm.cpp\ggml\src" -I"D:\re
  pos-git\chatllm.cpp\ggml\src\ggml-cuda\.." -I"D:\repos-git\chatllm.cpp\ggml\src\..\include" -I"C:\Program Files\NVIDI
  A GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"
  --keep-dir ggml-cuda\x64\Release -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart static -std=c++17 -a
  rch=native /wd4996 /wd4722 -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DGGML_BACKEND_BUI
  LD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUD
  A_F16 -DGGML_SHARED -D"CMAKE_INTDIR=\"Release\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Release\"" -
  Dggml_cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml-cuda.dir\Release\vc143.pdb" -
  o ggml-cuda.dir\Release\fattn-vec-f32-instance-hs64-f16-f16.obj "D:\repos-git\chatllm.cpp\ggml\src\ggml-cuda\template
  -instances\fattn-vec-f32-instance-hs64-f16-f16.cu"
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets
(799,9): error MSB3721: Polecenie "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-
env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\HostX64\x64" -x cu
-I"D:\repos-git\chatllm.cpp\ggml\include\ggml" -I"D:\repos-git\chatllm.cpp\ggml\src" -I"D:\repos-git\chatllm.cpp\ggml\s
rc\ggml-cuda\.." -I"D:\repos-git\chatllm.cpp\ggml\src\..\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA
\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir ggml-cuda\x64\Relea
se -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart static -std=c++17 -arch=native /wd4996 /wd4722 -Xcom
piler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SC
HED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_F16 -DGGML_SHARED -D"CMAKE_INTDIR=
\"Release\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Release\"" -Dggml_cuda_EXPORTS -Xcompiler "/EHsc /
W1 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml-cuda.dir\Release\vc143.pdb" -o ggml-cuda.dir\Release\fattn-vec-f32-in
stance-hs64-f16-f16.obj "D:\repos-git\chatllm.cpp\ggml\src\ggml-cuda\template-instances\fattn-vec-f32-instance-hs64-f16
-f16.cu" zostało zakończone; kod błędu: 1. [D:\repos-git\chatllm.cpp\build-gpu\ggml\src\ggml-cuda\ggml-cuda.vcxproj]

Build commands:

1. cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON
2. cmake --build build-gpu --config Release -j 6

Env almost the same, just updated CUDA to 12.6.

Feb 07 '25 17:02 MoonRide303

@MoonRide303 Now, this can be built against CUDA. But only a few models work I guess.

Feb 08 '25 02:02 foldl

I just tested b9916381801b952ad9e2ea17ae09edf7aa6f3220, and it seems to be partially working, now:

Successfully compiled using the same 2 commands as above.
Converted locally downloaded Qwen2.5-1.5B-Instruct model using python convert.py -i ..\Qwen2.5-1.5B-Instruct\ -t q8_0 -o Qwen2.5-1.5B-Instruct-Q8_0.bin command.
Started the interactive chat using main.exe -m .\Qwen2.5-1.5B-Instruct-Q8_0.bin -i command (runs on a CPU).

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
    ________          __  __    __    __  ___ (通义千问)
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ /
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/
You are served by QWen2,                      /_/   /_/
with 1543714304 (1.5B) parameters.

You  > hi there
A.I. > Hello! How can I help you today?
You  > who are you?
A.I. > I am a large language model created by Alibaba Cloud. I am here to help you with any questions or tasks you may have.
You  >

GPU acceleration doesn't work, yet - trying to start it with main.exe -ngl 99 -m .\Qwen2.5-1.5B-Instruct-Q8_0.bin -i results in the following error:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
D:\repos-git\chatllm.cpp\ggml\src\ggml-backend.cpp:1678: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed

Feb 08 '25 10:02 MoonRide303

How about -ngl 10? I have tested Vulkan & CUDA. It is OK with this model.

Feb 08 '25 14:02 foldl

Same error with -ngl 10 for me.

Feb 08 '25 15:02 MoonRide303

@MoonRide303 Sorry for the incorrect information.

I have tested QWen2.5 7B & Llama3.1 8B with CUDA. Note: models with lm_head tied to embedding do not work, generally.

build-cuda\bin\Release\main.exe -m g:\qwen2.5-7b.bin -p "write a quick sort function in python" -t 0 --max_length 100 -ngl 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0: total = 23621861376, free = 19653984256
Certainly! Below is a Python implementation of the Quick Sort algorithm:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x ==
RUN OUT OF CONTEXT. I have to stop now.



timings: prompt eval time =       697.05 ms /    26 tokens (    26.81 ms per token,    37.30 tokens per second)    
timings:        eval time =      2747.43 ms /    73 tokens (    37.64 ms per token,    26.57 tokens per second)    
timings:       total time =      3444.48 ms /    99 tokens

build-cuda\bin\Release\main.exe -m g:\llama3.1-8b.bin -p "write a quick sort function in python" -t 0 --max_length 100 -ngl 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0: total = 23621861376, free = 19653984256
**Quick Sort Function in Python**
=====================================

Here is a high-quality, readable, and well-documented implementation of the Quick Sort algorithm in Python:        

```python
def quick_sort(arr):
    """
    Sorts an array using the Quick Sort algorithm.

    Args:
        arr (list): The array to be sorted.

    Returns:
        list: The sorted array.
    """
    if len
RUN OUT OF CONTEXT. I have to stop now.



timings: prompt eval time =       295.89 ms /    19 tokens (    15.57 ms per token,    64.21 tokens per second)    
timings:        eval time =      3133.93 ms /    80 tokens (    39.17 ms per token,    25.53 tokens per second)    
timings:       total time =      3429.82 ms /    99 tokens

Feb 09 '25 02:02 foldl

@foldl I managed to launch Qwen2.5-7B-Instruct, too. When asked for writing a story I observed GPU usage close to 100% - so it seems GPU acceleration is working 👍. But it's pretty confusing that GPU acceleration works like that - with some models using the same architecture being able to use GPU, and some others not.

I also wanted to try Qwen2.5-7B-Instruct-1M, but despite enforcing small context it started allocating huge amounts of VRAM, and ended up with error:

main.exe -ngl 99 -c 2048 -m .\Qwen2.5-7B-Instruct-1M-Q8_0.bin -i
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1972.66 MiB on device 0: cudaMalloc failed: out of memory
D:\repos-git\chatllm.cpp\src\backend.cpp:91 check failed (buf) chatllm::LayerBufAllocator::alloc() failed to allocate buffer

Adding -c 2048 to normal Qwen2.5-7B-Instruct didn't reduce VRAM usage, either - it looks like this parametr is being ignored.

Feb 09 '25 13:02 MoonRide303

-l (i.e. --max_length) shall be used to reduce the VRAM usage.

-c is used by context extending method.

The naming is a little bit confusing.

Feb 09 '25 14:02 foldl

@MoonRide303 The bug related to Qwen2.5 1.5B is solved now. It's not caused by tied embedding, but by buffer allocation (not properly aligned).

Use -ngl 100,prolog,epilog to run the whole model on GPU.

Feb 11 '25 10:02 foldl

@MoonRide303 The bug related to Qwen2.5 1.5B is solved now. It's not caused by tied embedding, but by buffer allocation (not properly aligned).

Use -ngl 100,prolog,epilog to run the whole model on GPU.

Worked for me, too 👍. I also checked gemma-2-2b-it - looks good, as well:

> main.exe -ngl 99,prolog,epilog -l 2048 -m .\gemma-2-2b-it-Q8_0.bin -i
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
    ________          __  __    __    __  ___
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ /
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/
You are served by Gemma-2,                    /_/   /_/
with 2614341888 (2.6B) parameters.

You  > hi there
A.I. > Hello! 👋  How can I help you today? 😊

You  > who are you?
A.I. > I am Gemma, an AI assistant created by the Gemma team.  I'm here to help you with any questions or tasks you might have!

What can I do for you today? 😄

Feb 11 '25 19:02 MoonRide303

chatllm.cpp chatllm.cpp copied to clipboard

How to use GPU?

chatllm.cpp
chatllm.cpp copied to clipboard