cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
**What is your question?** May I ask if the epilogue of Cutlass supports customization? I hope to achieve the functionality of performing bias addition after the matmul operation in Cutlass....
Hi, @thakkarV https://github.com/NVIDIA/cutlass/blob/47a3ebbea9860e14c095b52c4e6e2db33340f572/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp#L237 Strangely, it requires TiledCopyS2R's threads equal to the MMA AtomC's threads. I think here we describe how each thread does LDS and therefore it should be: ```c++...
**What is your question?** Hello! I want to implement elementwise epilogue, which depends on output matrix coordinates, i.e. ``` d_ij = F(alpha * sum_k(a_ik * b_kj) + c_ij, i, j)...
I see this in example code: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu So I wonder is there any other legal layout? // Define block sizes (static) auto bM = Int{}; auto bN = Int{}; auto...
Thank you for your great work, for both cutlass and cute. I'm following instructions to build my program. I use make_tensor to build rav as a pointer to specific register...
Can someone walkthrough how `cosize` and `complement` are derived through a few simple examples? For example, when I run the `complement.cpp` `CuTe` unit test with `CUTLASS_TRACE_DEBUG_LEVEL=1`, the following is printed,...
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/tfloat32.h#L80 Why does the code represent "round toward nearest even", but not "round to nearest, ties away from zero"?
I can use nvcc in my environment, but when I run "cmake .. -DCUTLASS_NVCC_ARCHS=80", an error happens: -- CMake Version: 3.18.2 CMake Error at /usr/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:25 (message): Could not find compiler...
**Describe the bug** **Steps/Code to reproduce bug** ```cuda #include "cute/tensor.hpp" using namespace cute; __global__ void kernel() { constexpr auto weird = right_inverse(make_layout(_2{}, _1{})); print(weird); } int main() { kernel(); cudaDeviceSynchronize();...
In file include\cutlass\gemm\warp\mma_tensor_op_tile_iterator.h, In the store interface of iterator, As explained in the comments, "stores a tile with a logical offset in units of whole tiles", However, TensorRef is actually...