Andrej
Andrej
@ent0n29 does this make the code slower for you? I wonder if I should merge to master and get rid of README comment potentially
Sorry @ent0n29 what is the final recommendation here right now? Is it ``` CFLAGS = -Ofast -fno-finite-math-only -Wno-unused-result -march=native ```
yeah, agree i think. there are two instances of this btw
i'll come around to this eventually
C is just the simplest thing super happy to link to any mojo port (or anything else) from the main readme! we can benchmark and compare them all :)
GPT-2 is not far away from these SOTA models at all. The most complex new layer that is needed is probably RoPE, and even that is not too complex. My...
Oh one thing I'll say is that this being fp32, you probably don't want to try to work with anything much larger than ~1B. The smallest Llama 2 sadly is...
The [cuBLASLt API](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) started with CUDA 10.1, which was released Aug 2019. I've been trying to use code that can afaik work with fairly old versions of CUDA/cuBLAS. I think...
This was now merged, ty @andylolu2 for pointing out too.