cuBLAS: use host pinned memory and dequantize while copying
Copying memory to the GPU from pageable memory is slow because it forces CUDA to copy the buffer to non-pageable memory before it can DMA it to the GPU. This also means that cudaMemcpyAsync is actually synchronous.
By storing the ggml context in non-pageable, pinned memory, this additional copy is avoided, and cudaMemcpyAsync is done asynchronously. This also makes it possible to dequantize while copying data for the other matrix.
To observe most of the benefits, this has to be used with --no-mmap, otherwise the weights will be stored in paged, memory mapped memory. With mmap enabled, there is still some benefit from the non-weight matrices. In the future, this will be solved by caching the weights in the GPU memory, avoiding the copy entirely.
To avoid adding a CUDA-only function to the ggml interface, llama.cpp has been modified to include ggml-cuda.h when cuBLAS is enabled.
For me, this represents a ~30% speedup in perplexity times with cuBLAS.
PR:

Master:

I think these changes looks great. You said elsewhere that this stuff might cause "some friction", but I think it turns out to be very non-intrusive. The CUDA stuff is still relatively self-contained and is separated from the ggml core.
Of course, @ggerganov might have a different opinion, but think this should be merged as is.
Unrelated stuff
On AMD I'm noticing something funny, it creates 64 additional GPU threads. If I use --memory_f32 then not.

EDIT: 5.07 seconds per pass - ETA 55 minutes let's see in an hour or so.
@SlyEcho are you sure that this is with this branch and not cuda-f16f32? That one does create 64 additional streams.
@slaren, you are quite right, this is slaren/cuda-f16f32
But it does have the same changes included?
Anyway, perplexity on Q4_0 was [655]6.2838
But it does have the same changes included?
Yes, that branch is built on top of this one, with additional changes to the f16 x f32 mat mul.