llama.cpp
llama.cpp copied to clipboard
cuBLAS memory management update
I changed the memory management The current variant only supports 16 allocated free buffers and it uses the first free one even if a better size is available.
The new method comes with a couple changes:
- It upps the buffers to 512 (previously this was limited as it was going through a loop)
- introduces a size check (not enforced like previously with buffers)
- maintains a size ordered list of allocated available memory buffers
- uses binary search to insert and to take memory blocks It currently hardcodes 2GB of VRAM to match all cards, when RAM or buffer count is exceeded it will STDERR like the previous variant.
I tested it on a generic example with thousands of buffers and with llama and it appears to work fine. It's quite a bit of fresh code so no guarantees
This is my plan to improve the cuBLAS performance and memory usage:
Knowing that we only perform one mat mul at a time, we don't need more than 4 buffers, one for each of the d_X
, d_Y
, d_D
and d_Q
matrices. If we are able to predict the maximum sizes correctly, we would never need to allocate more than these four buffers, and this can be done fairly easily by adding some checks to ggml_graph_compute
. Additionally, to avoid having to re-upload the model weights constantly and use the available VRAM more efficiently, the d_Q
buffer could be replaced by a cache. That's what I have been working towards in my cuda-cache
branch.
So for what I am doing, there is no need at all to have more buffers, on the contrary, the memory pool code could be simplified further once we know that we will never need to reallocate these buffers. Do you have any specific plans that would benefit from these changes?
We could also investigate some more advanced memory management like cudaMallocManaged
to have a single pointer in both device and host memory space, we don't need to copy the data manually from one place to the other but just use the prefetch command to make it available on device or host if required.
Performance-wise I don't know if it would make a difference.
I think a prefetch / cache is certainly the way to go, there is a ton of improvements to the current implementation.
Regarding my variant: It was not tailored to llama, I'm just posting it here as the only ggml project that actually uses cuBLAS currently. If you use the same 4 buffers each time it won't be necessary but sooner or later another idea will come up and more buffers will be used, or another project will use cuBLAS and need more than a few buffers.
The current implementation would assign a 4GB allocation to a 1MB malloc request if it's the first one available.
So it's more a ggml improvement than a llama improvement.
@SlyEcho I gave that I try but cudaMemPrefetchAsync
doesn't work on my machine and without it the performance is very bad, so at least for me this doesn't seem to be a workable solution.
@slaren that sucks.
What about cudaHostAlloc()
and cudaHostGetDevicePointer()
?