cuBLASDx samples feedback
Hello I wanted to see if cuBLASDx was a feasible replacement for something I thought I needed to do with cutlass/cute. Here was my workflow:
- Oh they have a fused kernel example!
- Oh, it's only launching one block?
- Oh, they have a multiblock example (and only one besides attn and fft), but it seems slow. Let's try boosting up K.
- Static assertion for block size kicks in. I read comment saying how it doesn't split on K. I change 'K' and specify 'k' as 1024 and 64 respectively.
- A cooperative copy assertion kicks in that I don't understand.
I'm curious about these choices of making almost all of the examples a launch with grid_size=1. If the kernels were mostly complete it would be decently easy to fiddle with and figure out how to do something specific to your needs. It also seems like it would be good to at least have one canonical example that just implements K as arbitrary for a simple gemm (maybe command line params to get a feel for the speed?)
FWIW i'm trying to do an outer product of two matrices and then issue an fma on the result in place against another matrix in vram. A fused kernel for this would save an unbelievable amount of vram traffic in certain problem sizes.
cuBLASdx 0.3.0 addresses this. See here and see sample code in example/cublasdx from the official release.
We're working on pushing updated samples that will work with all cuBLASDx 0.3.0 features.
There was small delay with samples. Now we have latest version up on github as well (not just in the download). @capybara-club were you able to use latest cuBLASDx and see improvements?
@llukas They look better! However, It would be really great if the fused matrix multiply examples were set up for multiple blocks. I don't really understand why any of the examples would only launch with 1 block. If i do a search for "<<<1," to find the samples that launch 1 block, most of the examples still show up.