llama.cpp cuBLAS memory management update

I changed the memory management The current variant only supports 16 allocated free buffers and it uses the first free one even if a better size is available.

The new method comes with a couple changes:

It upps the buffers to 512 (previously this was limited as it was going through a loop)
introduces a size check (not enforced like previously with buffers)
maintains a size ordered list of allocated available memory buffers
uses binary search to insert and to take memory blocks It currently hardcodes 2GB of VRAM to match all cards, when RAM or buffer count is exceeded it will STDERR like the previous variant.

I tested it on a generic example with thousands of buffers and with llama and it appears to work fine. It's quite a bit of fresh code so no guarantees

Apr 24 '23 02:04 cmp-nct

This is my plan to improve the cuBLAS performance and memory usage: Knowing that we only perform one mat mul at a time, we don't need more than 4 buffers, one for each of the d_X, d_Y, d_D and d_Q matrices. If we are able to predict the maximum sizes correctly, we would never need to allocate more than these four buffers, and this can be done fairly easily by adding some checks to ggml_graph_compute. Additionally, to avoid having to re-upload the model weights constantly and use the available VRAM more efficiently, the d_Q buffer could be replaced by a cache. That's what I have been working towards in my cuda-cache branch.

So for what I am doing, there is no need at all to have more buffers, on the contrary, the memory pool code could be simplified further once we know that we will never need to reallocate these buffers. Do you have any specific plans that would benefit from these changes?

Apr 24 '23 05:04 slaren

We could also investigate some more advanced memory management like cudaMallocManaged to have a single pointer in both device and host memory space, we don't need to copy the data manually from one place to the other but just use the prefetch command to make it available on device or host if required.

Performance-wise I don't know if it would make a difference.

Apr 24 '23 11:04 SlyEcho

I think a prefetch / cache is certainly the way to go, there is a ton of improvements to the current implementation.

Regarding my variant: It was not tailored to llama, I'm just posting it here as the only ggml project that actually uses cuBLAS currently. If you use the same 4 buffers each time it won't be necessary but sooner or later another idea will come up and more buffers will be used, or another project will use cuBLAS and need more than a few buffers.

The current implementation would assign a 4GB allocation to a 1MB malloc request if it's the first one available.

So it's more a ggml improvement than a llama improvement.

Apr 24 '23 12:04 cmp-nct

@SlyEcho I gave that I try but cudaMemPrefetchAsync doesn't work on my machine and without it the performance is very bad, so at least for me this doesn't seem to be a workable solution.

Apr 24 '23 16:04 slaren

@slaren that sucks.

What about cudaHostAlloc() and cudaHostGetDevicePointer()?

Apr 24 '23 17:04 SlyEcho

llama.cpp llama.cpp copied to clipboard

cuBLAS memory management update

llama.cpp
llama.cpp copied to clipboard