llm.c Fused bias with matmul using `cublasLtMatmul`

Just want to mention that cuBLAS (via the newer cuBLASLt API) does offer an interface that fuses matmul with bias addition: cublasLtMatmul() which computes D = A @ B + C.

You can use the bias as C and even get the broadcasting for free by setting the leading dimension of C to 0. The interface is a lot more verbose than cublasSgemm though.

https://github.com/karpathy/llm.c/blob/a08c11b60ebb1b3300113b808c9770b0ff3a21b4/dev/cuda/matmul_forward.cu#L87-L97

Apr 10 '24 23:04 andylolu2

The cuBLASLt API started with CUDA 10.1, which was released Aug 2019. I've been trying to use code that can afaik work with fairly old versions of CUDA/cuBLAS. I think this is probably worth doing if it's faster though, which it should be. Will take a look at using this in /dev/cuda/matmul.cu as an additional kernel that uses this API (or would gladly welcome a PR, too).

Apr 11 '24 16:04 karpathy

This was now merged, ty @andylolu2 for pointing out too.

Apr 13 '24 07:04 karpathy