llama.cpp
llama.cpp copied to clipboard
Cuda refactor, multi GPU support
This PR is quite large. Its primary goal is to lay the groundwork for the implementation of further CUDA kernels for ggml operations. I am also adding multi GPU support because it's easier to integrate now than it would be at a later point.
For Users
Build instructions (Linux):
git clone https://github.com/JohannesGaessler/llama.cpp llama.cpp-johannesgaessler
cd llama.cpp-johannesgaessler
git fetch
git switch cuda-refactor-8
make LLAMA_CUBLAS=1
When compiling with LLAMA_CUBLAS=1 the program automatically detects the available NVIDIA devices and splits weights proportional to VRAM. There is not yet a CLI argument for setting the tensor splits. The performance increase on my test systems is relatively low (+70% t/s when going from 1x GTX TITAN X to 4x GTX TITAN X). It's possible that there is still a bug that hampers performance. Please do tell me how well (if at all) it works for you. In any case, this PR should already allow you to pool the VRAM of multiple GPUs to load larger models.
For Developers
~~This PR is still very much WIP. I will do a refactor to remove artifacts from bad/obsolete design decisions. You can already review the code if you want but many of the flaws are still subject to change.~~ Should be good now.
On master there are separate functions for invoking CUDA kernels. Apart from invoking the actual CUDA kernels they do other things such as copying data between host and device. This PR adds a template ggml_cuda_op that manages
- the transfer of data between host and device,
- the dequantization of
src0(needed for cuBLAS matrix multiplication), - the broadcasting of
src1acrosssrc0(needed for multiplication), - and multi GPU things.
The actual operations now only need to define how the data should be manipulated.
This PR also moves the entry point for invoking CUDA kernels from the ggml function such as ggml_compute_forward_mul_mat_q_f32 and instead adds a function ggml_cuda_compute_forward that is called from ggml_compute_forward. For this to work I moved ggml_task_type and ggml_compute_params from ggml.c to ggml.h.
This PR adds an int for the layer, an int for the device id, and dedicated device data pointers to ggml_tensor. I need these for bookkeeping. I also changed the backends from GGML_BACKEND_CUDA and GGML_BACKEND_OPENCL to GGML_BACKEND_GPU (tensor data on 1 GPU) and GGML_BACKEND_GPU_SPLIT (tensor data split across all GPUs). Since I think that we don't want to support the simultaneous use of CUDA and OpenCL it's simpler to just use the same backend types for both implementations and to differentiate via defines.
Thanks so much for adding multi GPU support, I was looking forward to it. You are the king man. This is extremly useful to increase my vram for 30b. I'm posting my stats with my 1080ti+1080. Amazing stuff, its noticeable faster.
Stats with this version:
./main -m '/media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin' -n 128 --n-gpu-layers 39 --threads 6 --no-mmap -s 1685192470
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 596 (428c342)
main: seed = 1685192470
ggml_init_cublas: found 2 CUDA devices:
1. NVIDIA GeForce GTX 1080 Ti
2. NVIDIA GeForce GTX 1080
llama.cpp: loading model from /media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin
llama_model_load_internal: format = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 23269.21 MB
llama_model_load_internal: mem required = 10646.42 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 39 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 14926 MB
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
system_info: n_threads = 6 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Threading;
using Microsoft.Build.Framework;
using Microsoft.Build.Utilities;
namespace MSBuild.ExtensionPack.Platform
{
/// <summary>
/// Adds support for a target to find the first non-empty environment variable, by name
llama_print_timings: load time = 14079.62 ms
llama_print_timings: sample time = 55.40 ms / 128 runs ( 0.43 ms per token)
llama_print_timings: prompt eval time = 722.20 ms / 2 tokens ( 361.10 ms per token)
llama_print_timings: eval time = 54367.11 ms / 127 runs ( 428.09 ms per token)
llama_print_timings: total time = 68522.85 ms
Compare old Version:
./main -m '/media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin' -n 128 --n-gpu-layers 17 --threads 6 --no-mmap -s 1685192470
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 583 (7e4ea5b)
main: seed = 1685192470
llama.cpp: loading model from /media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin
llama_model_load_internal: format = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 23269.14 MB
llama_model_load_internal: mem required = 19066.59 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 17 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 6506 MB
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
system_info: n_threads = 6 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Threading;
using Microsoft.Build.Framework;
using Microsoft.Build.Utilities;
namespace MSBuild.ExtensionPack.Platform
{
/// <summary>
/// Adds support for a target to find the first non-empty environment variable, by name
llama_print_timings: load time = 11117.69 ms
llama_print_timings: sample time = 55.35 ms / 128 runs ( 0.43 ms per token)
llama_print_timings: prompt eval time = 896.98 ms / 2 tokens ( 448.49 ms per token)
llama_print_timings: eval time = 82108.70 ms / 127 runs ( 646.53 ms per token)
llama_print_timings: total time = 93302.20 ms
Would be too much to ask cross platform (Nvidia + AMD) support LMAO.
I will try to check later, I think I can access a machine with 2x 2080 Ti
Performance numbers from my test machine with an i5-4570S, 16 GB of RAM @ 1600 MHz, and a GTX 1070 + a GTX 1050 ti:
| Model | GPU | ms/t | t/s |
|---|---|---|---|
| 7b q4_0 | GTX 1070 | 71.37 | 14.01 |
| 7b q4_0 | GTX 1070 + GTX 1050 ti | 68.66 | 14.56 |
| 13b q4_0 | GTX 1070 | 134.19 | 7.45 |
| 13b q4_0 | GTX 1070 + GTX 1050 ti | 128.13 | 7.80 |
| 33b q4_0 | GTX 1070 | Unusable | Unusable |
| 33b q4_0 | GTX 1070 + GTX 1050 ti | 575.12 | 1.74 |
Numbers for single GPU are obtained using the master branch.
Note: previously I was able to run 33b q4_0 with just the GTX 1070 on master; there may be something on master that has increased RAM usage since.
33b
You mean 30B? I can run 30B Q4_0 with my 8 GB card with 20 layers loaded only.
You mean 30B? I can run 30B Q4_0 with my 8 GB card with 20 layers loaded only.
"30B" seems to be a typo by Meta that has become dominant. In the paper they talk about a "33B" model so that is the term that I'm using.
I think ggml_cuda_mul and ggml_cuda_mul_mat can be removed from ggml-cuda.h now and made static.
I added a comment to explain the weird device to host memcpy for split tensors. Since I as the person who wrote the code won't know: are there other parts of the code that are unintuitive or difficult to understand?
I added a CLI argument that lets the user set the tensor split. On my system a less VRAM efficient split of 3:1 seems to do better than 2:1 because it's more efficient in terms of compute:
| Model | GPU | ms/t | t/s |
|---|---|---|---|
| 7b q4_0 | GTX 1070 | 71.37 | 14.01 |
| 7b q4_0 | GTX 1070 + GTX 1050 ti, 2:1 split | 68.66 | 14.56 |
| 7b q4_0 | GTX 1070 + GTX 1050 ti, 3:1 split | 59.03 | 16.94 |
| 13b q4_0 | GTX 1070 | 134.19 | 7.45 |
| 13b q4_0 | GTX 1070 + GTX 1050 ti, 2:1 split | 128.13 | 7.80 |
| 13b q4_0 | GTX 1070 + GTX 1050 ti, 3:1 split | 109.14 | 9.15 |
| 33b q4_0 | GTX 1070 | Unusable | Unusable |
| 33b q4_0 | GTX 1070 + GTX 1050 ti, 2:1 split | 575.12 | 1.74 |
| 33b q4_0 | GTX 1070 + GTX 1050 ti, 3:1 split | 571.10 | 1.75 |
The performance increase on my test systems is relatively low (+70% t/s when going from 1x GTX TITAN X to 4x GTX TITAN X).
May I ask the RAM and CPU you used for this test system?
It's a server set up by my institute. The CPU is a Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. I don't know what kind of RAM they put in since I'm not aware of any way to query this without superuser privileges. The test is also complicated by the fact that other people were using the machine at the time (and still are).
Alright, unless I'm forgetting something this PR should now be ready to be merged from my end.
There seems to be an issue with f16 models.
I fixed f16. I should perhaps mention that this quantization type does not support multiple GPUs; I plan to work on better f16 support in the future (and see if that will allow you to use less VRAM) and will change it then.
I was using very short prompt for testing. There is an issue with long prompts.
I'm getting following error:
cuBLAS error 14 at ggml-cuda.cu:759
when running
./main -m ~/llms/ggml-vic13b-q5_1.bin -p "hello" -ngl 1
Followed your instruction in building the binaries. 4 x A40-48G cards
Can you quickly check whether the code produces correct results with --tensor-split 1,0,0,0?
I fixed the issue with prompt processing. f16 still seems to have a bug somewhere with multiple GPUs.
I fixed the f16 issues. As a side effect f16 t/s also went up by ~100% because until now it was always using the general f16 f32 matrix multiplication function rather than the dequantization + matrix vector multiplication kernel that I implemented.
@huichen Can you do another test? The problem may have been caused by me only using one cuBLAS handle instead of one per GPU.
@huichen Can you do another test? The problem may have been caused by me only using one cuBLAS handle instead of one per GPU.
It works now, both with and without --tensor-split. Following are eval times for different --tensor-split I experimented, numbers are averaged from three runs each (all layers offloaded to GPU)
| --tensor-split | prompt eval ms/t | eval ms/t |
|---|---|---|
| 1,0,0,0 | 9.17 | 65.80 |
| 1,1,0,0 | 8.84 | 49.90 |
| 1,1,1,0 | 9.08 | 48.18 |
| 1,1,1,1 | 9.29 | 43.85 |
Cheers!
@JohannesGaessler Did you see that project ? https://github.com/turboderp/exllama/tree/master/exllama_ext/cuda_func Looks like a ton of kernels on MIT license including full matmul for half and quantized variants, rope, norm etc. Not sure where it comes from
Yes, as I've said a dozen times before: I have seen the exllama repository. But it's not as simple as copy-pasting code from one project to another. I already know what to implement, the problem is just doing it in a way that actually works. Most of my time doing development is spent hunting down ggml-specific bugs and the exllama repository does not help here. Also to get good performance you have to e.g. consider the memory layout of ggml tensors and adjust your implementation accordingly. So using exllama code 1:1 in ggml probably won't work well.
Getting good matmul perf is amazingly hard. CLBlast is like 2x slower than cuBLAS/rocBLAS for example. But these are libraries for general use, for llama.cpp it may be possible to create something that is custom but still very fast.
Just a quick update - sorry for the delayed review. Really focused on the Metal branch. Hope to finish it in a few days and then will come back to more active reviews.
There are a few other important PRs that are also pending and I need to get familiar with them before merging
Don't worry, I'm patient. Currently I'm working on GPU acceleration for the remaining tensors. If I get a working version before this PR gets merged, should I just keep pushing to this PR or save it for a new one?
should I just keep pushing to this PR or save it for a new one?
The current PR is a good addition on its own. Probably better to have the new stuff in a separate PR
It compiles, finds my cuda devices, but is it using them?
ggml_init_cublas: found 2 CUDA devices:
1. Tesla P40
2. Tesla P40
llama.cpp: loading model from ./models/65B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.18 MB
llama_model_load_internal: mem required = 38610.46 MB (+ 5120.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
.
llama_init_from_file: kv self size = 1280.00 MB
system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
nvidia-smi does show main bound to each card:
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 5791 C ./main 1300MiB |
| 1 N/A N/A 5791 C ./main 1300MiB |
+---------------------------------------------------------------------------------------+
while llama is responding, I don't see any utilization:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 On | 00000000:05:00.0 Off | Off |
| N/A 32C P0 50W / 250W| 1310MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P40 On | 00000000:42:00.0 Off | Off |
| N/A 34C P0 51W / 250W| 1310MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
So I think it is actually not using my GPUs and is running in CPU mode still.
You need to set the --n-gpu-layers CLI argument to utilize the GPUs.
You need to set the --n-gpu-layers CLI argument to utilize the GPUs.
Is there a way to know how many layers the model needs at a certain bit depth? I think 65B needs 80, based on what I saw in text-generation-webui (turns out it is 80)
Speed is about 2x what I was seeing in text-generation-webui. Very nice.
Is there a way to know how many layers the model needs at a certain bit depth?
If you know they'll all fit, you should be able to set it to an absurdly high number like 10000. From what I've seen people say, it's not an error to set it higher than the number of layers in the model. Of course, if your GPU doesn't have enough VRAM it's going to die once it tries to load the data.