cudnn compile-time improvement
- Compilation time down from 1m to 31s (on my local machine). This is achieved by renaming the cudnn_attn.cu to cudnn_attn.cpp. The nvcc upon looking the extension directly forwards it to the host compier which significantly reduces the compile time
- Modified the Makefile to clean the
cudnn_attn.oas well. - Simplified the graph cache. Instead of caching all the device pointers, use UID as a place holder
- Removed the workspace size assert in fprop. Because, fprop sdpa in Hopper will require some workspace (~16B) in cudnn
The caveat here is the location of "cuda_runtime_api.h", needs to be added to windows.
@Anerudhan - some of your Makefile changes are already here: https://github.com/karpathy/llm.c/pull/357
The windows changes are in there as well. Feel free to comment on any changes (even better if Windows-related) in that PR. Thanks!
I think this was already merged via previous PRs, closing
Hi @karpathy,
Thanks for looking into this. I am afraid, the top of tree still does not have the best possible cudnn compilation time.
I have split this into two PRs as it tries to solve multiple problems and reduce complexity.
https://github.com/karpathy/llm.c/pull/386 is the first one. This just targets the compile time improvements.
Once that is merged I will create another PR to resolve failures in H100 and simplify the usage of cudnn.
Got it, ok ty!