Fused bias with matmul using `cublasLtMatmul`
Just want to mention that cuBLAS (via the newer cuBLASLt API) does offer an interface that fuses matmul with bias addition: cublasLtMatmul() which computes D = A @ B + C.
You can use the bias as C and even get the broadcasting for free by setting the leading dimension of C to 0. The interface is a lot more verbose than cublasSgemm though.
https://github.com/karpathy/llm.c/blob/a08c11b60ebb1b3300113b808c9770b0ff3a21b4/dev/cuda/matmul_forward.cu#L87-L97
The cuBLASLt API started with CUDA 10.1, which was released Aug 2019. I've been trying to use code that can afaik work with fairly old versions of CUDA/cuBLAS. I think this is probably worth doing if it's faster though, which it should be. Will take a look at using this in /dev/cuda/matmul.cu as an additional kernel that uses this API (or would gladly welcome a PR, too).
This was now merged, ty @andylolu2 for pointing out too.