ocannl
ocannl copied to clipboard
Study and incorporate Andrej Karpathy's `llm.c` lessons
"A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch"
https://github.com/karpathy/llm.c
https://twitter.com/karpathy/status/1779354343013269929
https://twitter.com/karpathy/status/1781387674978533427 achieved parity with PyTorch FP32
The "study" part is certainly aiming at versions 0.6.x, but many solutions will wait till 0.9.x.