dfdx CUDA Graphs

CUDA Graphs

Open ViliamVadocz opened this issue 2 years ago • 2 comments

We should consider whether it is possible and desired to automatically combine kernels into CUDA graphs to reduce overhead of calling individual kernels.

Here is the relevant documentation:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

This issue is probably not relevant to the GPU MVP, but should be tackled once optimizations become a concern. I opened this issue now because support for graphs might influcence how the GPU support code is structured.

Maybe gpu kernel operations should be lazy to allow combining kernels?
Should there be an API for making graphs manually?
Do we combine operations into graphs at compile time?

Other resources:

How PyTorch uses CUDA Graphs: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
CUDA Graph Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH

Jan 13 '23 14:01 ViliamVadocz

Relevant labels for this issue would be gpu and optimization.

Feb 27 '23 17:02 ViliamVadocz

I think we could do this at the device level - CudaGraph would be similar to Cuda, but instead of launching kernels it would add nodes to a graph (if not there already). I could envision the forward/backward/optimizer passes all being part of the graph. Perhaps when dev -> host transfer is requested is when it is actually executed?

Mar 31 '23 12:03 coreylowman

dfdx dfdx copied to clipboard

CUDA Graphs

dfdx
dfdx copied to clipboard