dfdx
dfdx copied to clipboard
CUDA Graphs
We should consider whether it is possible and desired to automatically combine kernels into CUDA graphs to reduce overhead of calling individual kernels.
Here is the relevant documentation:
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
This issue is probably not relevant to the GPU MVP, but should be tackled once optimizations become a concern. I opened this issue now because support for graphs might influcence how the GPU support code is structured.
- Maybe gpu kernel operations should be lazy to allow combining kernels?
- Should there be an API for making graphs manually?
- Do we combine operations into graphs at compile time?
Other resources:
- How PyTorch uses CUDA Graphs: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
- CUDA Graph Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH
Relevant labels for this issue would be gpu
and optimization
.
I think we could do this at the device level - CudaGraph
would be similar to Cuda
, but instead of launching kernels it would add nodes to a graph (if not there already). I could envision the forward/backward/optimizer passes all being part of the graph. Perhaps when dev -> host transfer is requested is when it is actually executed?