[CUDA] Buffer cache and CUDA graph
To reduce the overhead of CUDA graph to minimum, we should avoid updating graph as possible as we can, and in the most ideal situation the graph should be created once and always replayed after. This is very hard to achieve in MLX because the memory address of arrays almost always change.
However I think it is still totally achievable in MLX. First let's check how llama.cpp integrates CUDA graph: https://github.com/ggml-org/llama.cpp/pull/6766. The code is actually quite simple: it just captures the graph and then always replays, it has some cases invalidating the graph, but it does not worry about updating memory addresses. They can do this because llama.cpp uses a very simple strategy for memory allocations: it allocates a large chunk of memory in advance (whose size is estimated by user), and all arrays created in the compute graph would be allocated on the preallocated memory, so there is no change of memory address in each iteration.
MLX uses the traditional memory management: the arrays allocates memory when created and frees memory when destroyed. But with buffer cache, usually the allocations and frees only happen in the first iterations and then every array would just trade with buffer cache without doing real allocations.
What if the buffer cache is designed in a way that, for the same graph the arrays are guaranteed to be given the same memory addresses when created?
If we can do that, a few things would be possible:
- We can make sure memory addresses passed to kernels keep unchanged in each iteration.
- There is no more need to free buffers after first iteration, i.e. no more insertion of
event_signalkernels.
And then we can achieve the perfect graph replay.
Some reference links:
PyTorch's notes on Cuda Caching Allocator Interaction with Graph Capture:
Graph capture performs a dry run of a region of execution, freezing all CUDA work (and virtual addresses used during that work) into a "graph." The graph may be "replayed" like a single giant kernel, with greatly reduced CPU overhead as well as modestly improved GPU performance.
Because capture bakes in memory addresses, the memory used during capture must be available for the graph to use during replay. DeviceCachingAllocator assigns and frees memory eagerly and dynamically, so if we're not careful about managing graphs' memory, at replay time those memory addresses could be used by other tensors.
To guarantee a graph's baked in addresses are safe to reuse in replay, DeviceAllocator satisfies allocations from a graph-private memory pool during capture, and doesn't begin cudaFreeing those addresses until the graph is destroyed.
Within the private pool, allocations are freed and reassigned as usual during capture. Memory regions will be used in a consistent order during replay. So a private pool doesn't use memory more wastefully than the default pools during capture, but it does reserve its high-water mark of used memory away from the default pools as long as the capture(s) it served survive (regardless whether those captures are idle or replaying).
CUDAGraph Trees: Advanced ways to handle dynamic input/outputs and share memory pool between multiple graphs.
- Design doc
- PR: https://github.com/pytorch/pytorch/pull/89146
There are efforts in PyTorch that seek to update the memory addresses in a captured graph instead of forcing all tensors to be allocated with static addresses:
- https://github.com/pytorch/pytorch/pull/137318
- https://github.com/pytorch/pytorch/pull/152622
- https://github.com/pytorch/pytorch/pull/160351