llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Improve cuBLAS performance by dequantizing on the GPU

Open slaren opened this issue 1 year ago • 1 comments

For me this makes cuBLAS about twice as fast with quantized models.

Perplexity seconds per pass

Model PR Master
q4_0 5.05 8.62
q4_1 5.37 8.59
q4_2 4.99 10.76

Prompt eval time with 7B q4_0 (bs=512)

cuBLAS (PR):     prompt eval time =  7840.48 ms /   631 tokens (   12.43 ms per token)
cuBLAS (Master): prompt eval time = 15457.33 ms /   631 tokens (   24.50 ms per token)
OpenBLAS:        prompt eval time = 34856.06 ms /   631 tokens (   55.24 ms per token)
No BLAS:         prompt eval time = 43549.67 ms /   631 tokens (   69.02 ms per token)

13B q4_0

cuBLAS (PR):     prompt eval time = 13826.48 ms /   631 tokens (   21.91 ms per token)
cuBLAS (Master): prompt eval time = 27987.82 ms /   631 tokens (   44.35 ms per token)
OpenBLAS:        prompt eval time = 61476.58 ms /   631 tokens (   97.43 ms per token)
No BLAS:         prompt eval time = 81645.43 ms /   631 tokens (  129.39 ms per token)

slaren avatar Apr 19 '23 16:04 slaren

I just added some prompt eval times for 7B q4_0.

slaren avatar Apr 19 '23 16:04 slaren

Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65% or so. Amazing work!

All tests run with: $ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 512 -t <N>

w/ cuBLAS (this PR):

1 thread  - 5.12 seconds per pass - ETA 0.93 hours
2 threads - 3.82 seconds per pass - ETA 0.70 hours
4 threads - 3.22 seconds per pass - ETA 0.59 hours
8 threads - 2.93 seconds per pass - ETA 0.53 hours
16 threads - 2.82 seconds per pass - ETA 0.51 hours

Without CUDA (8 threads):

8 threads  - 16.57 seconds per pass - ETA 3.01 hours
16 threads - 11.46 seconds per pass - ETA 2.08 hours

glinscott avatar Apr 19 '23 20:04 glinscott

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

7B - 2.93 seconds per pass - ETA 0.53 hours
13B - 4.86 seconds per pass - ETA 0.88 hours
30B - 10.99 seconds per pass - ETA 2.00 hours
65B - 37.98 seconds per pass - ETA 6.91 hours

Btw, for comparison, from last month 7B was at 24.58 seconds per pass - ETA 4.47 hours!

glinscott avatar Apr 19 '23 21:04 glinscott

Nice! Don't use this just yet to run perplexity computations though, I found a synchronization issue that may cause inaccurate results. Should be fixed in the last commit though, I am running a full perplexity test and if it looks good it will be ready to merge.

slaren avatar Apr 19 '23 21:04 slaren

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?

Green-Sky avatar Apr 19 '23 21:04 Green-Sky

Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!

wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ?

I'm using mmap mode, so it has to go to disk to read parts of the model in as it's going. That was brutally slow previously, but the overlap with running things on GPU seems to make it feasible now.

cuBLAS - 37.98 seconds per pass - ETA 6.91 hours
CPU - 109.73 seconds per pass - ETA 19.96 hours

Actually, even on CPU it's much better than it used to be. Everyone doing amazing work here :).

glinscott avatar Apr 19 '23 21:04 glinscott

We could probably get another 10% or so speedup by pre-allocating the cuda memory, but I am not sure how to do that without littering the ggml code with mode cuda specific stuff.

slaren avatar Apr 19 '23 21:04 slaren

On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.

slaren avatar Apr 19 '23 21:04 slaren

A problem while building for Windows using Visual Studio:

FAILED: CMakeFiles/ggml.dir/ggml-cuda.cu.obj nvcc.exe -forward-unknown-to-host-compiler -DGGML_USE_CUBLAS -D_CRT_SECURE_NO_WARNINGS -I..\..\..\. -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler="-MD -Zi -O2 -Ob1" -DNDEBUG /arch:AVX2 -MD -MT CMakeFiles\ggml.dir\ggml-cuda.cu.obj -MF CMakeFiles\ggml.dir\ggml-cuda.cu.obj.d -x cu -c ..\..\..\ggml-cuda.cu -o CMakeFiles\ggml.dir\ggml-cuda.cu.obj -Xcompiler=-FdCMakeFiles\ggml.dir\,-FS nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified

avada-z avatar Apr 19 '23 22:04 avada-z

On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use.

I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1. You need a larger batch size to cover the overhead of using the library. Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.

ghost avatar Apr 19 '23 23:04 ghost

I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1

That's right, the default batch is 8, but the minimum to use BLAS is 32.

Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048.

Currently the maximum batch size is 512, if you try to use a larger one it will be clamped to 512.

slaren avatar Apr 19 '23 23:04 slaren

@avada-z I think it should be fixed now.

slaren avatar Apr 19 '23 23:04 slaren

I ported this to HIP and hipBLAS.

The dequant on the device is nothing earth-shattering. Still dominated by the memcpy. But I have an older card and only PCIe 3.0.

image

SlyEcho avatar Apr 20 '23 00:04 SlyEcho

Hello there. I'm trying to build it using the make LLAMA_CUBLAS=1 command with Windows and WIN64DevKit. However, even though I have CUDA Toolkit installed and changed the paths for -L and -I in the makefile accordingly, it still misses the following libaries: -lcublas_static -lculibos -lcublasLt_static and -ldl:.

Where can I get them? I would appreciate some help getting this to work. Thank you!

Dampfinchen avatar Apr 20 '23 13:04 Dampfinchen

Still dominated by the memcpy

If the weights are stored in the device HBM/DRAM, I suspect we can get much better perf than copying the weights each time.

jon-chuang avatar Apr 26 '23 17:04 jon-chuang