chatllm.cpp
chatllm.cpp copied to clipboard
How to use GPU?
This is dedicated to those who are GPU-poor, but stay tuned. 😄
@foldl I tried to build it with GPU support by using cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON (it works fine for llama.cpp), but compilation fails with errors like that:
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
Env: Windows 11, MSVC 2022, CUDA 12.5.
@foldl I tried to build it with GPU support by using
cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON(it works fine for llama.cpp), but compilation fails with errors like that:nvcc fatal : A single input file is required for a non-link phase when an outputfile is specifiedEnv: Windows 11, MSVC 2022, CUDA 12.5.
I think this is fixed in 3dd468dbf621ed8e8cdb10caf84bd1f5891695c2.
@foldl I tried to build it with GPU support by using
cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON(it works fine for llama.cpp), but compilation fails with errors like that:nvcc fatal : A single input file is required for a non-link phase when an outputfile is specifiedEnv: Windows 11, MSVC 2022, CUDA 12.5.
I think this is fixed in 3dd468d.
I tried it on current master (14c71642d384e98de9964c3e0a017a28b549b9e4), problem is still there - the same error message for .cu files, like that:
Error log
D:\repos-git\chatllm.cpp\build-gpu\ggml\src\ggml-cuda>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\n
vcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\b
in\HostX64\x64" -x cu -I"D:\repos-git\chatllm.cpp\ggml\include\ggml" -I"D:\repos-git\chatllm.cpp\ggml\src" -I"D:\re
pos-git\chatllm.cpp\ggml\src\ggml-cuda\.." -I"D:\repos-git\chatllm.cpp\ggml\src\..\include" -I"C:\Program Files\NVIDI
A GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"
--keep-dir ggml-cuda\x64\Release -use_fast_math -maxrregcount=0 --machine 64 --compile -cudart static -std=c++17 -a
rch=native /wd4996 /wd4722 -Xcompiler="/EHsc -Ob2" -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DGGML_BACKEND_BUI
LD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUD
A_F16 -DGGML_SHARED -D"CMAKE_INTDIR=\"Release\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Release\"" -
Dggml_cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdggml-cuda.dir\Release\vc143.pdb" -
o ggml-cuda.dir\Release\fattn-vec-f32-instance-hs64-f16-f16.obj "D:\repos-git\chatllm.cpp\ggml\src\ggml-cuda\template
-instances\fattn-vec-f32-instance-hs64-f16-f16.cu"
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets
(799,9): error MSB3721: Polecenie "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" --use-local-
env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\HostX64\x64" -x cu
-I"D:\repos-git\chatllm.cpp\ggml\include\ggml" -I"D:\repos-git\chatllm.cpp\ggml\src" -I"D:\repos-git\chatllm.cpp\ggml\s
rc\ggml-cuda\.." -I"D:\repos-git\chatllm.cpp\ggml\src\..\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA
\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" --keep-dir ggml-cuda\x64\Relea
se -use_fast_math -maxrregcount=0 --machine 64 --compile -cudart static -std=c++17 -arch=native /wd4996 /wd4722 -Xcom
piler="/EHsc -Ob2" -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SC
HED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_F16 -DGGML_SHARED -D"CMAKE_INTDIR=
\"Release\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Release\"" -Dggml_cuda_EXPORTS -Xcompiler "/EHsc /
W1 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdggml-cuda.dir\Release\vc143.pdb" -o ggml-cuda.dir\Release\fattn-vec-f32-in
stance-hs64-f16-f16.obj "D:\repos-git\chatllm.cpp\ggml\src\ggml-cuda\template-instances\fattn-vec-f32-instance-hs64-f16
-f16.cu" zostało zakończone; kod błędu: 1. [D:\repos-git\chatllm.cpp\build-gpu\ggml\src\ggml-cuda\ggml-cuda.vcxproj]
Build commands:
1. cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON
2. cmake --build build-gpu --config Release -j 6
Env almost the same, just updated CUDA to 12.6.
@MoonRide303 Now, this can be built against CUDA. But only a few models work I guess.
I just tested b9916381801b952ad9e2ea17ae09edf7aa6f3220, and it seems to be partially working, now:
- Successfully compiled using the same 2 commands as above.
- Converted locally downloaded Qwen2.5-1.5B-Instruct model using
python convert.py -i ..\Qwen2.5-1.5B-Instruct\ -t q8_0 -o Qwen2.5-1.5B-Instruct-Q8_0.bincommand. - Started the interactive chat using
main.exe -m .\Qwen2.5-1.5B-Instruct-Q8_0.bin -icommand (runs on a CPU).
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
________ __ __ __ __ ___ (通义千问)
/ ____/ /_ ____ _/ /_/ / / / / |/ /_________ ____
/ / / __ \/ __ `/ __/ / / / / /|_/ // ___/ __ \/ __ \
/ /___/ / / / /_/ / /_/ /___/ /___/ / / // /__/ /_/ / /_/ /
\____/_/ /_/\__,_/\__/_____/_____/_/ /_(_)___/ .___/ .___/
You are served by QWen2, /_/ /_/
with 1543714304 (1.5B) parameters.
You > hi there
A.I. > Hello! How can I help you today?
You > who are you?
A.I. > I am a large language model created by Alibaba Cloud. I am here to help you with any questions or tasks you may have.
You >
- GPU acceleration doesn't work, yet - trying to start it with
main.exe -ngl 99 -m .\Qwen2.5-1.5B-Instruct-Q8_0.bin -iresults in the following error:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
D:\repos-git\chatllm.cpp\ggml\src\ggml-backend.cpp:1678: GGML_ASSERT((char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)) failed
How about -ngl 10? I have tested Vulkan & CUDA. It is OK with this model.
Same error with -ngl 10 for me.
@MoonRide303 Sorry for the incorrect information.
I have tested QWen2.5 7B & Llama3.1 8B with CUDA. Note: models with lm_head tied to embedding do not work, generally.
build-cuda\bin\Release\main.exe -m g:\qwen2.5-7b.bin -p "write a quick sort function in python" -t 0 --max_length 100 -ngl 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0: total = 23621861376, free = 19653984256
Certainly! Below is a Python implementation of the Quick Sort algorithm:
```python
def quick_sort(arr):
if len(arr) <= 1:
return arr
else:
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x ==
RUN OUT OF CONTEXT. I have to stop now.
timings: prompt eval time = 697.05 ms / 26 tokens ( 26.81 ms per token, 37.30 tokens per second)
timings: eval time = 2747.43 ms / 73 tokens ( 37.64 ms per token, 26.57 tokens per second)
timings: total time = 3444.48 ms / 99 tokens
build-cuda\bin\Release\main.exe -m g:\llama3.1-8b.bin -p "write a quick sort function in python" -t 0 --max_length 100 -ngl 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0: total = 23621861376, free = 19653984256
**Quick Sort Function in Python**
=====================================
Here is a high-quality, readable, and well-documented implementation of the Quick Sort algorithm in Python:
```python
def quick_sort(arr):
"""
Sorts an array using the Quick Sort algorithm.
Args:
arr (list): The array to be sorted.
Returns:
list: The sorted array.
"""
if len
RUN OUT OF CONTEXT. I have to stop now.
timings: prompt eval time = 295.89 ms / 19 tokens ( 15.57 ms per token, 64.21 tokens per second)
timings: eval time = 3133.93 ms / 80 tokens ( 39.17 ms per token, 25.53 tokens per second)
timings: total time = 3429.82 ms / 99 tokens
@foldl I managed to launch Qwen2.5-7B-Instruct, too. When asked for writing a story I observed GPU usage close to 100% - so it seems GPU acceleration is working 👍. But it's pretty confusing that GPU acceleration works like that - with some models using the same architecture being able to use GPU, and some others not.
I also wanted to try Qwen2.5-7B-Instruct-1M, but despite enforcing small context it started allocating huge amounts of VRAM, and ended up with error:
main.exe -ngl 99 -c 2048 -m .\Qwen2.5-7B-Instruct-1M-Q8_0.bin -i
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1972.66 MiB on device 0: cudaMalloc failed: out of memory
D:\repos-git\chatllm.cpp\src\backend.cpp:91 check failed (buf) chatllm::LayerBufAllocator::alloc() failed to allocate buffer
Adding -c 2048 to normal Qwen2.5-7B-Instruct didn't reduce VRAM usage, either - it looks like this parametr is being ignored.
-l (i.e. --max_length) shall be used to reduce the VRAM usage.
-c is used by context extending method.
The naming is a little bit confusing.
@MoonRide303 The bug related to Qwen2.5 1.5B is solved now. It's not caused by tied embedding, but by buffer allocation (not properly aligned).
Use -ngl 100,prolog,epilog to run the whole model on GPU.
@MoonRide303 The bug related to Qwen2.5 1.5B is solved now. It's not caused by tied embedding, but by buffer allocation (not properly aligned).
Use
-ngl 100,prolog,epilogto run the whole model on GPU.
Worked for me, too 👍. I also checked gemma-2-2b-it - looks good, as well:
> main.exe -ngl 99,prolog,epilog -l 2048 -m .\gemma-2-2b-it-Q8_0.bin -i
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
0: total = 17170956288, free = 15778971648
________ __ __ __ __ ___
/ ____/ /_ ____ _/ /_/ / / / / |/ /_________ ____
/ / / __ \/ __ `/ __/ / / / / /|_/ // ___/ __ \/ __ \
/ /___/ / / / /_/ / /_/ /___/ /___/ / / // /__/ /_/ / /_/ /
\____/_/ /_/\__,_/\__/_____/_____/_/ /_(_)___/ .___/ .___/
You are served by Gemma-2, /_/ /_/
with 2614341888 (2.6B) parameters.
You > hi there
A.I. > Hello! 👋 How can I help you today? 😊
You > who are you?
A.I. > I am Gemma, an AI assistant created by the Gemma team. I'm here to help you with any questions or tasks you might have!
What can I do for you today? 😄