cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
Hi! I have a Batched Matrix Multiply problem with no fixed stride between batches. The minimalist example is the following (all the matrices are RowMajor): I want to calculate $O...
**Describe the bug** Fused GEMM example gives the wrong result for some values of `problemSize1.K`. **Steps/Code to reproduce bug** Set the following problem sizes in `examples/13_two_tensor_op_fusion/fused_two_gemms_f16_sm80_shmem.cu` ```c++ cutlass::gemm::GemmCoord gemm_f16_sm80_problem_size_0(128*640, 48,...
I run TF32 gemm example, set different stages(1 of 4) has different accurate. why?
**Describe the bug** CUTLASS and EGL header file conflict, if you include EGL header file (#include ) before including CUTLASS header file, a compilation error will occur, which can be...
Is `b1 x b1` GEMMs all implemented by XOR that requires `uint1_t x uint1_t` ? What if `A=uint1_t` and `B=int1_t` ? (e.g. A is ReLU output, B is weight) Thanks...
**Is your feature request related to a problem? Please describe.** When using the -conv-fprop of cutlass to perform the conv operation, it is found that in the entire kernel, the...
Hi! I have written a code for slicedK in GEMM, but it seems very slow....I tried to understand cutlass's slicedK, but can not understand it....So I post my code here...
fixed bugs and update verification logics. * removed verification for `Max`, making the verification logic more consistent: we don't check `Sum`, then we won't check `Max`. * fixed the correctness...
Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32. @danthe3rd , would you please give it...
I want to implement BN layer as an epilogue with cutlass, which requires both division and plus operations. I want to know is there a way to implement something like...