llama.cpp
llama.cpp copied to clipboard
Improve cuBLAS performance by dequantizing on the GPU
For me this makes cuBLAS about twice as fast with quantized models.
Perplexity seconds per pass
Model | PR | Master |
---|---|---|
q4_0 | 5.05 | 8.62 |
q4_1 | 5.37 | 8.59 |
q4_2 | 4.99 | 10.76 |
Prompt eval time with 7B q4_0 (bs=512)
cuBLAS (PR): prompt eval time = 7840.48 ms / 631 tokens ( 12.43 ms per token)
cuBLAS (Master): prompt eval time = 15457.33 ms / 631 tokens ( 24.50 ms per token)
OpenBLAS: prompt eval time = 34856.06 ms / 631 tokens ( 55.24 ms per token)
No BLAS: prompt eval time = 43549.67 ms / 631 tokens ( 69.02 ms per token)
13B q4_0
cuBLAS (PR): prompt eval time = 13826.48 ms / 631 tokens ( 21.91 ms per token)
cuBLAS (Master): prompt eval time = 27987.82 ms / 631 tokens ( 44.35 ms per token)
OpenBLAS: prompt eval time = 61476.58 ms / 631 tokens ( 97.43 ms per token)
No BLAS: prompt eval time = 81645.43 ms / 631 tokens ( 129.39 ms per token)
I just added some prompt eval times for 7B q4_0.
Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65% or so. Amazing work!
All tests run with: $ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 512 -t <N>
w/ cuBLAS (this PR):
1 thread - 5.12 seconds per pass - ETA 0.93 hours
2 threads - 3.82 seconds per pass - ETA 0.70 hours
4 threads - 3.22 seconds per pass - ETA 0.59 hours
8 threads - 2.93 seconds per pass - ETA 0.53 hours
16 threads - 2.82 seconds per pass - ETA 0.51 hours
Without CUDA (8 threads):
8 threads - 16.57 seconds per pass - ETA 3.01 hours
16 threads - 11.46 seconds per pass - ETA 2.08 hours
Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!
7B - 2.93 seconds per pass - ETA 0.53 hours
13B - 4.86 seconds per pass - ETA 0.88 hours
30B - 10.99 seconds per pass - ETA 2.00 hours
65B - 37.98 seconds per pass - ETA 6.91 hours
Btw, for comparison, from last month 7B was at 24.58 seconds per pass - ETA 4.47 hours
!
Nice! Don't use this just yet to run perplexity computations though, I found a synchronization issue that may cause inaccurate results. Should be fixed in the last commit though, I am running a full perplexity test and if it looks good it will be ready to merge.
Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!
wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?
Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!
wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?
I'm using mmap mode, so it has to go to disk to read parts of the model in as it's going. That was brutally slow previously, but the overlap with running things on GPU seems to make it feasible now.
cuBLAS - 37.98 seconds per pass - ETA 6.91 hours
CPU - 109.73 seconds per pass - ETA 19.96 hours
Actually, even on CPU it's much better than it used to be. Everyone doing amazing work here :).
We could probably get another 10% or so speedup by pre-allocating the cuda memory, but I am not sure how to do that without littering the ggml code with mode cuda specific stuff.
On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.
A problem while building for Windows using Visual Studio:
FAILED: CMakeFiles/ggml.dir/ggml-cuda.cu.obj
nvcc.exe -forward-unknown-to-host-compiler -DGGML_USE_CUBLAS -D_CRT_SECURE_NO_WARNINGS -I..\..\..\. -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler="-MD -Zi -O2 -Ob1" -DNDEBUG /arch:AVX2 -MD -MT CMakeFiles\ggml.dir\ggml-cuda.cu.obj -MF CMakeFiles\ggml.dir\ggml-cuda.cu.obj.d -x cu -c ..\..\..\ggml-cuda.cu -o CMakeFiles\ggml.dir\ggml-cuda.cu.obj -Xcompiler=-FdCMakeFiles\ggml.dir\,-FS
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.
I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1. You need a larger batch size to cover the overhead of using the library. Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.
I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1
That's right, the default batch is 8, but the minimum to use BLAS is 32.
Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.
Currently the maximum batch size is 512, if you try to use a larger one it will be clamped to 512.
@avada-z I think it should be fixed now.
I ported this to HIP and hipBLAS.
The dequant on the device is nothing earth-shattering. Still dominated by the memcpy. But I have an older card and only PCIe 3.0.
Hello there. I'm trying to build it using the make LLAMA_CUBLAS=1 command with Windows and WIN64DevKit. However, even though I have CUDA Toolkit installed and changed the paths for -L and -I in the makefile accordingly, it still misses the following libaries: -lcublas_static -lculibos -lcublasLt_static and -ldl:.
Where can I get them? I would appreciate some help getting this to work. Thank you!
Still dominated by the memcpy
If the weights are stored in the device HBM/DRAM, I suspect we can get much better perf than copying the weights each time.