llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

cuBLAS: use host pinned memory and dequantize while copying

Open slaren opened this issue 2 years ago • 5 comments

Copying memory to the GPU from pageable memory is slow because it forces CUDA to copy the buffer to non-pageable memory before it can DMA it to the GPU. This also means that cudaMemcpyAsync is actually synchronous.

By storing the ggml context in non-pageable, pinned memory, this additional copy is avoided, and cudaMemcpyAsync is done asynchronously. This also makes it possible to dequantize while copying data for the other matrix.

To observe most of the benefits, this has to be used with --no-mmap, otherwise the weights will be stored in paged, memory mapped memory. With mmap enabled, there is still some benefit from the non-weight matrices. In the future, this will be solved by caching the weights in the GPU memory, avoiding the copy entirely.

To avoid adding a CUDA-only function to the ggml interface, llama.cpp has been modified to include ggml-cuda.h when cuBLAS is enabled.

For me, this represents a ~30% speedup in perplexity times with cuBLAS.

PR: image

Master: image

slaren avatar Apr 27 '23 20:04 slaren

I think these changes looks great. You said elsewhere that this stuff might cause "some friction", but I think it turns out to be very non-intrusive. The CUDA stuff is still relatively self-contained and is separated from the ggml core.

Of course, @ggerganov might have a different opinion, but think this should be merged as is.

dfyz avatar Apr 28 '23 00:04 dfyz

Unrelated stuff

On AMD I'm noticing something funny, it creates 64 additional GPU threads. If I use --memory_f32 then not.

image

Otherwise it works too, I will add the additional definitions to my port so it can be merged.

EDIT: 5.07 seconds per pass - ETA 55 minutes let's see in an hour or so.

SlyEcho avatar Apr 28 '23 07:04 SlyEcho

@SlyEcho are you sure that this is with this branch and not cuda-f16f32? That one does create 64 additional streams.

slaren avatar Apr 28 '23 08:04 slaren

@slaren, you are quite right, this is slaren/cuda-f16f32

But it does have the same changes included?

Anyway, perplexity on Q4_0 was [655]6.2838

SlyEcho avatar Apr 28 '23 08:04 SlyEcho

But it does have the same changes included?

Yes, that branch is built on top of this one, with additional changes to the f16 x f32 mat mul.

slaren avatar Apr 28 '23 08:04 slaren