llm.c
llm.c copied to clipboard
cuda code that approaches cublas performance
https://colab.research.google.com/drive/1RNFSPtD0o9aJFwnqKQSRabODtSZjwPN1 by https://makslevental.github.io/ based on https://siboehm.com/articles/22/CUDA-MMM seems quite fast and then I'm also looking at this: https://thunder.snu.ac.kr/?page_id=64&page=6 I'm just fishing for opinions and am planning to try to emulate that blog/website and try to implement matmul_forward for this repo to start. Or if anyone else wants to use these for reference, please go ahead.