llm.c cudnn compile-time improvement

Compilation time down from 1m to 31s (on my local machine). This is achieved by renaming the cudnn_attn.cu to cudnn_attn.cpp. The nvcc upon looking the extension directly forwards it to the host compier which significantly reduces the compile time
Modified the Makefile to clean the cudnn_attn.o as well.
Simplified the graph cache. Instead of caching all the device pointers, use UID as a place holder
Removed the workspace size assert in fprop. Because, fprop sdpa in Hopper will require some workspace (~16B) in cudnn

The caveat here is the location of "cuda_runtime_api.h", needs to be added to windows.

May 06 '24 03:05 Anerudhan

@Anerudhan - some of your Makefile changes are already here: https://github.com/karpathy/llm.c/pull/357

The windows changes are in there as well. Feel free to comment on any changes (even better if Windows-related) in that PR. Thanks!

May 06 '24 05:05 rosslwheeler

I think this was already merged via previous PRs, closing

May 08 '24 18:05 karpathy

Hi @karpathy,

Thanks for looking into this. I am afraid, the top of tree still does not have the best possible cudnn compilation time.

I have split this into two PRs as it tries to solve multiple problems and reduce complexity.

https://github.com/karpathy/llm.c/pull/386 is the first one. This just targets the compile time improvements.

Once that is merged I will create another PR to resolve failures in H100 and simplify the usage of cudnn.

May 08 '24 19:05 Anerudhan

Got it, ok ty!

May 08 '24 20:05 karpathy