cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
Clang built from source: https://clang.llvm.org/get_started.html ``` ../llvm-project/build/bin/clang -v clang version 18.0.0git (https://github.com/llvm/llvm-project.git a855b2c894444419c3689aff6fd0381fdeb02491) ``` main.cpp ``` #include #include "cutlass/epilogue/collective/collective_builder.hpp" int main() { cutlass::half_t x = 2.25_hf; std::cout
auto gA = local_tile(mA, blk_shape, blk_coord, Step{}); // (BLK_M,BLK_K,k) I am learning this line in example code: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu How we get this? By the way, I print it out, size...
``` auto tC = make_layout(make_shape(Int{}, Int{})); auto tCsA = local_partition(sA, tC, threadIdx.x, Step{}); ``` But I get (_8,_8) as tCsA's shape, why??? I am learning code: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu
**What is your question?** Hello, thanks for your project. cutlass version: 2.10 device RTX 3090 I want to implement a W4A4 conv quantization in tensorrt_llm by cutlass. Follow the example...
**What is your question?** ``` Array access Users access a Tensor's elements in one of three ways: operator(), taking as many integral arguments as the number of modes, corresponding to...
**What is your question?** Hi! I see swizzle.hpp file, but I am not that clever to use it. Like for sgemm_nt.cu code you provided, could you show me how to...
**Describe the bug** Using DefaultCopy on A100 implicitly generates the unexpected LDGSTS. Users are not aware of the need to commit and wait. **Steps/Code to reproduce bug** ``` using GmemTiledCopy...
I think [cpp11.cu](https://github.com/NVIDIA/cutlass/blob/6e60b9b17c5e6734488dbb7401b5c55ccb37feba/test/unit/core/cpp11.cu#L76) should be comparing against (from https://gcc.gnu.org/onlinedocs/cpp/Standard-Predefined-Macros.html) `201103L`. Although I vaguely remember that with a newer compiler, it can be difficult to test old standard compatibility. So maybe...
**What is your question?** Hi, Thanks for the great work! Recently, I am exploring the performance improvement from all of the optimization in CUTLASS. I want to profile all of...
**What is your question?** I try to use the `cutlass::conv::device::Convolution` with the fixed ThreadblockShape, WarpShape and InstructionShape. There is internal error which is too many resources requested actually. It may...