dfdx icon indicating copy to clipboard operation
dfdx copied to clipboard

CUDA Graphs

Open ViliamVadocz opened this issue 2 years ago • 2 comments

We should consider whether it is possible and desired to automatically combine kernels into CUDA graphs to reduce overhead of calling individual kernels.

Here is the relevant documentation:

  • https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

This issue is probably not relevant to the GPU MVP, but should be tackled once optimizations become a concern. I opened this issue now because support for graphs might influcence how the GPU support code is structured.

  • Maybe gpu kernel operations should be lazy to allow combining kernels?
  • Should there be an API for making graphs manually?
  • Do we combine operations into graphs at compile time?

Other resources:

  • How PyTorch uses CUDA Graphs: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
  • CUDA Graph Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH

ViliamVadocz avatar Jan 13 '23 14:01 ViliamVadocz

Relevant labels for this issue would be gpu and optimization.

ViliamVadocz avatar Feb 27 '23 17:02 ViliamVadocz

I think we could do this at the device level - CudaGraph would be similar to Cuda, but instead of launching kernels it would add nodes to a graph (if not there already). I could envision the forward/backward/optimizer passes all being part of the graph. Perhaps when dev -> host transfer is requested is when it is actually executed?

coreylowman avatar Mar 31 '23 12:03 coreylowman