llama.cpp Cuda refactor, multi GPU support

trafficstars

This PR is quite large. Its primary goal is to lay the groundwork for the implementation of further CUDA kernels for ggml operations. I am also adding multi GPU support because it's easier to integrate now than it would be at a later point.

For Users

Build instructions (Linux):

git clone https://github.com/JohannesGaessler/llama.cpp llama.cpp-johannesgaessler
cd llama.cpp-johannesgaessler                               
git fetch
git switch cuda-refactor-8
make LLAMA_CUBLAS=1

When compiling with LLAMA_CUBLAS=1 the program automatically detects the available NVIDIA devices and splits weights proportional to VRAM. There is not yet a CLI argument for setting the tensor splits. The performance increase on my test systems is relatively low (+70% t/s when going from 1x GTX TITAN X to 4x GTX TITAN X). It's possible that there is still a bug that hampers performance. Please do tell me how well (if at all) it works for you. In any case, this PR should already allow you to pool the VRAM of multiple GPUs to load larger models.

For Developers

~~This PR is still very much WIP. I will do a refactor to remove artifacts from bad/obsolete design decisions. You can already review the code if you want but many of the flaws are still subject to change.~~ Should be good now.

On master there are separate functions for invoking CUDA kernels. Apart from invoking the actual CUDA kernels they do other things such as copying data between host and device. This PR adds a template ggml_cuda_op that manages

the transfer of data between host and device,
the dequantization of src0 (needed for cuBLAS matrix multiplication),
the broadcasting of src1 across src0 (needed for multiplication),
and multi GPU things.

The actual operations now only need to define how the data should be manipulated.

This PR also moves the entry point for invoking CUDA kernels from the ggml function such as ggml_compute_forward_mul_mat_q_f32 and instead adds a function ggml_cuda_compute_forward that is called from ggml_compute_forward. For this to work I moved ggml_task_type and ggml_compute_params from ggml.c to ggml.h.

This PR adds an int for the layer, an int for the device id, and dedicated device data pointers to ggml_tensor. I need these for bookkeeping. I also changed the backends from GGML_BACKEND_CUDA and GGML_BACKEND_OPENCL to GGML_BACKEND_GPU (tensor data on 1 GPU) and GGML_BACKEND_GPU_SPLIT (tensor data split across all GPUs). Since I think that we don't want to support the simultaneous use of CUDA and OpenCL it's simpler to just use the same backend types for both implementations and to differentiate via defines.

May 27 '23 09:05 JohannesGaessler

Thanks so much for adding multi GPU support, I was looking forward to it. You are the king man. This is extremly useful to increase my vram for 30b. I'm posting my stats with my 1080ti+1080. Amazing stuff, its noticeable faster.

Stats with this version:

./main -m '/media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin' -n 128 --n-gpu-layers 39 --threads 6 --no-mmap -s 1685192470
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 596 (428c342)
main: seed  = 1685192470
ggml_init_cublas: found 2 CUDA devices:
  1. NVIDIA GeForce GTX 1080 Ti
  2. NVIDIA GeForce GTX 1080
llama.cpp: loading model from /media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 23269.21 MB
llama_model_load_internal: mem required  = 10646.42 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 39 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 14926 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 6 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 // Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.

using System;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Threading;
using Microsoft.Build.Framework;
using Microsoft.Build.Utilities;

namespace MSBuild.ExtensionPack.Platform
{
    /// <summary>
    /// Adds support for a target to find the first non-empty environment variable, by name
llama_print_timings:        load time = 14079.62 ms
llama_print_timings:      sample time =    55.40 ms /   128 runs   (    0.43 ms per token)
llama_print_timings: prompt eval time =   722.20 ms /     2 tokens (  361.10 ms per token)
llama_print_timings:        eval time = 54367.11 ms /   127 runs   (  428.09 ms per token)
llama_print_timings:       total time = 68522.85 ms

Compare old Version:

./main -m '/media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin' -n 128 --n-gpu-layers 17 --threads 6 --no-mmap -s 1685192470
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 583 (7e4ea5b)
main: seed  = 1685192470
llama.cpp: loading model from /media/w/PhoenixSSD/oobabooga/text-generation-webui/models/supercot30b-ggml/ggml-model-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 23269.14 MB
llama_model_load_internal: mem required  = 19066.59 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 17 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 6506 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 6 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 // Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.

using System;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Threading;
using Microsoft.Build.Framework;
using Microsoft.Build.Utilities;

namespace MSBuild.ExtensionPack.Platform
{
    /// <summary>
    /// Adds support for a target to find the first non-empty environment variable, by name
llama_print_timings:        load time = 11117.69 ms
llama_print_timings:      sample time =    55.35 ms /   128 runs   (    0.43 ms per token)
llama_print_timings: prompt eval time =   896.98 ms /     2 tokens (  448.49 ms per token)
llama_print_timings:        eval time = 82108.70 ms /   127 runs   (  646.53 ms per token)
llama_print_timings:       total time = 93302.20 ms

May 27 '23 13:05 ENjoyBlue2021

Would be too much to ask cross platform (Nvidia + AMD) support LMAO.

I will try to check later, I think I can access a machine with 2x 2080 Ti

May 27 '23 13:05 SlyEcho

Performance numbers from my test machine with an i5-4570S, 16 GB of RAM @ 1600 MHz, and a GTX 1070 + a GTX 1050 ti:

Model	GPU	ms/t	t/s
7b q4_0	GTX 1070	71.37	14.01
7b q4_0	GTX 1070 + GTX 1050 ti	68.66	14.56
13b q4_0	GTX 1070	134.19	7.45
13b q4_0	GTX 1070 + GTX 1050 ti	128.13	7.80
33b q4_0	GTX 1070	Unusable	Unusable
33b q4_0	GTX 1070 + GTX 1050 ti	575.12	1.74

Numbers for single GPU are obtained using the master branch.

Note: previously I was able to run 33b q4_0 with just the GTX 1070 on master; there may be something on master that has increased RAM usage since.

May 27 '23 13:05 JohannesGaessler

33b

You mean 30B? I can run 30B Q4_0 with my 8 GB card with 20 layers loaded only.

May 27 '23 14:05 SlyEcho

You mean 30B? I can run 30B Q4_0 with my 8 GB card with 20 layers loaded only.

"30B" seems to be a typo by Meta that has become dominant. In the paper they talk about a "33B" model so that is the term that I'm using.

May 27 '23 14:05 JohannesGaessler

I think ggml_cuda_mul and ggml_cuda_mul_mat can be removed from ggml-cuda.h now and made static.

May 27 '23 16:05 slaren

I added a comment to explain the weird device to host memcpy for split tensors. Since I as the person who wrote the code won't know: are there other parts of the code that are unintuitive or difficult to understand?

May 27 '23 17:05 JohannesGaessler

I added a CLI argument that lets the user set the tensor split. On my system a less VRAM efficient split of 3:1 seems to do better than 2:1 because it's more efficient in terms of compute:

Model	GPU	ms/t	t/s
7b q4_0	GTX 1070	71.37	14.01
7b q4_0	GTX 1070 + GTX 1050 ti, 2:1 split	68.66	14.56
7b q4_0	GTX 1070 + GTX 1050 ti, 3:1 split	59.03	16.94
13b q4_0	GTX 1070	134.19	7.45
13b q4_0	GTX 1070 + GTX 1050 ti, 2:1 split	128.13	7.80
13b q4_0	GTX 1070 + GTX 1050 ti, 3:1 split	109.14	9.15
33b q4_0	GTX 1070	Unusable	Unusable
33b q4_0	GTX 1070 + GTX 1050 ti, 2:1 split	575.12	1.74
33b q4_0	GTX 1070 + GTX 1050 ti, 3:1 split	571.10	1.75

May 27 '23 22:05 JohannesGaessler

The performance increase on my test systems is relatively low (+70% t/s when going from 1x GTX TITAN X to 4x GTX TITAN X).

May I ask the RAM and CPU you used for this test system?

May 28 '23 01:05 fgdfgfthgr-fox

It's a server set up by my institute. The CPU is a Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. I don't know what kind of RAM they put in since I'm not aware of any way to query this without superuser privileges. The test is also complicated by the fact that other people were using the machine at the time (and still are).

May 28 '23 04:05 JohannesGaessler

Alright, unless I'm forgetting something this PR should now be ready to be merged from my end.

May 28 '23 15:05 JohannesGaessler

There seems to be an issue with f16 models.

May 28 '23 19:05 JohannesGaessler

I fixed f16. I should perhaps mention that this quantization type does not support multiple GPUs; I plan to work on better f16 support in the future (and see if that will allow you to use less VRAM) and will change it then.

May 28 '23 21:05 JohannesGaessler

I was using very short prompt for testing. There is an issue with long prompts.

May 29 '23 10:05 JohannesGaessler

I'm getting following error:

cuBLAS error 14 at ggml-cuda.cu:759

when running

./main -m ~/llms/ggml-vic13b-q5_1.bin -p "hello" -ngl 1

Followed your instruction in building the binaries. 4 x A40-48G cards

May 29 '23 11:05 huichen

Can you quickly check whether the code produces correct results with --tensor-split 1,0,0,0?

May 29 '23 12:05 JohannesGaessler

I fixed the issue with prompt processing. f16 still seems to have a bug somewhere with multiple GPUs.

May 29 '23 17:05 JohannesGaessler

I fixed the f16 issues. As a side effect f16 t/s also went up by ~100% because until now it was always using the general f16 f32 matrix multiplication function rather than the dequantization + matrix vector multiplication kernel that I implemented.

May 29 '23 19:05 JohannesGaessler

@huichen Can you do another test? The problem may have been caused by me only using one cuBLAS handle instead of one per GPU.

May 29 '23 21:05 JohannesGaessler

@huichen Can you do another test? The problem may have been caused by me only using one cuBLAS handle instead of one per GPU.

It works now, both with and without --tensor-split. Following are eval times for different --tensor-split I experimented, numbers are averaged from three runs each (all layers offloaded to GPU)

--tensor-split	prompt eval ms/t	eval ms/t
1,0,0,0	9.17	65.80
1,1,0,0	8.84	49.90
1,1,1,0	9.08	48.18
1,1,1,1	9.29	43.85

Cheers!

May 30 '23 01:05 huichen

@JohannesGaessler Did you see that project ? https://github.com/turboderp/exllama/tree/master/exllama_ext/cuda_func Looks like a ton of kernels on MIT license including full matmul for half and quantized variants, rope, norm etc. Not sure where it comes from

May 30 '23 10:05 cmp-nct

Yes, as I've said a dozen times before: I have seen the exllama repository. But it's not as simple as copy-pasting code from one project to another. I already know what to implement, the problem is just doing it in a way that actually works. Most of my time doing development is spent hunting down ggml-specific bugs and the exllama repository does not help here. Also to get good performance you have to e.g. consider the memory layout of ggml tensors and adjust your implementation accordingly. So using exllama code 1:1 in ggml probably won't work well.

May 30 '23 11:05 JohannesGaessler

Getting good matmul perf is amazingly hard. CLBlast is like 2x slower than cuBLAS/rocBLAS for example. But these are libraries for general use, for llama.cpp it may be possible to create something that is custom but still very fast.

May 30 '23 13:05 SlyEcho

Just a quick update - sorry for the delayed review. Really focused on the Metal branch. Hope to finish it in a few days and then will come back to more active reviews.

There are a few other important PRs that are also pending and I need to get familiar with them before merging

May 31 '23 19:05 ggerganov

Don't worry, I'm patient. Currently I'm working on GPU acceleration for the remaining tensors. If I get a working version before this PR gets merged, should I just keep pushing to this PR or save it for a new one?

May 31 '23 19:05 JohannesGaessler

should I just keep pushing to this PR or save it for a new one?

The current PR is a good addition on its own. Probably better to have the new stuff in a separate PR

May 31 '23 19:05 ggerganov

It compiles, finds my cuda devices, but is it using them?

ggml_init_cublas: found 2 CUDA devices:
  1. Tesla P40
  2. Tesla P40
llama.cpp: loading model from ./models/65B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.18 MB
llama_model_load_internal: mem required  = 38610.46 MB (+ 5120.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
.
llama_init_from_file: kv self size  = 1280.00 MB

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.

nvidia-smi does show main bound to each card:

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      5791      C   ./main                                     1300MiB |
|    1   N/A  N/A      5791      C   ./main                                     1300MiB |
+---------------------------------------------------------------------------------------+

while llama is responding, I don't see any utilization:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                       On | 00000000:05:00.0 Off |                  Off |
| N/A   32C    P0               50W / 250W|   1310MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P40                       On | 00000000:42:00.0 Off |                  Off |
| N/A   34C    P0               51W / 250W|   1310MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

So I think it is actually not using my GPUs and is running in CPU mode still.

Jun 01 '23 17:06 quarterturn

You need to set the --n-gpu-layers CLI argument to utilize the GPUs.

Jun 01 '23 18:06 JohannesGaessler

You need to set the --n-gpu-layers CLI argument to utilize the GPUs.

Is there a way to know how many layers the model needs at a certain bit depth? I think 65B needs 80, based on what I saw in text-generation-webui (turns out it is 80)

Speed is about 2x what I was seeing in text-generation-webui. Very nice.

Jun 01 '23 20:06 quarterturn

Is there a way to know how many layers the model needs at a certain bit depth?

If you know they'll all fit, you should be able to set it to an absurdly high number like 10000. From what I've seen people say, it's not an error to set it higher than the number of layers in the model. Of course, if your GPU doesn't have enough VRAM it's going to die once it tries to load the data.

Jun 02 '23 00:06 KerfuffleV2

llama.cpp llama.cpp copied to clipboard

Cuda refactor, multi GPU support

For Users

For Developers

llama.cpp
llama.cpp copied to clipboard