Haicheng Wu
Haicheng Wu
BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the...
36 is not multiple of 8. The kernel instantiated in the example needs M to be multiple of 8. You can change the alignment to run 36. Or you can...
alignmentA is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L73 alignmentB is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L81 alignmentC is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L58 Both A and C needs to be change to 4 if M is 36 and layouts are col...
Back to your original batch gemm profiling problem you can use this cmake command ``` cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sgemm ``` to just run fp32 gemm on A100.
We actually tried `CUDA_VERSION`, but it causes us issues. We do not recommend to mix different versions of cuda and nvcc, especially the major version number is different. They are...
I honestly don't remember. We need to support the big matrix of different version {nvcc, nvrtc, clang, msvc, nvc++} x different gpus. One or a few of the combination has...
The same as cublas, your case is not supported by cutlass without any code change. However, you can use semophore to control the order of storing the global memory. This...
> Am I right, that the only reason I get wrong answer are the threadblocks using the memory that should have been locked? If I introduce a proper timing between...
Congratulations! If possible, maybe you can use `blockIdx.z` to decide the write order. When you finish your code, you can first make your repository public. We can help guide people...
not in 2.6 :smile: