cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
**What is your question?** Hi there, thank you for the work on CUTLASS3.0/CuTe. The "layout algebra" in 3.0 is much more elegant and easier to use than iterators. I guess...
Could any body explain "Layout compatibility"? Show examples will be nice.
how to implement general conv fwd/dgrad/wgrad by cute? could you give examples based on hopper cute?
I have gone through the documentation and the available APIs, but I couldn't find explicit information on whether CuTe supports sparse tensor operations or not. Does CuTe currently support sparse...
Hello, every cutlass experts, I'm confused by the implementation of Semaphore. its "fetch" like this: ```C++ if (wait_thread) { #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 700 asm volatile ("ld.global.acquire.gpu.b32 %0, [%1];\n"...
**Is your feature request related to a problem? Please describe.** As a user of CUTLASS, I would like to build a shared object library, `libA.so`, that internally uses CUTLASS function...
https://github.com/NVIDIA/cutlass/blob/5c447dd84f8ae0e1d48ff9a2eae26ce8c4958101/include/cutlass/gemm/warp/default_mma_tensor_op.h#L121 https://github.com/NVIDIA/cutlass/blob/5c447dd84f8ae0e1d48ff9a2eae26ce8c4958101/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h#L43 `default_mma_tensor_op.h` includes `default_mma_tensor_op_sm80.h`, while the later also includes the former. Is this a problem?
### Discussed in https://github.com/NVIDIA/cutlass/discussions/1504 Originally posted by **wzhcz8902** April 28, 2024 https://github.com/NVIDIA/cutlass/blob/5c447dd84f8ae0e1d48ff9a2eae26ce8c4958101/include/cutlass/gemm/warp/mma_tensor_op.h#L140-L168 As a newbie to cutlass, I think this struct is targeted for tensor cores, not cuda cores as...
support Layout::kFactor with 8 loading data from shared memory: |0 | 16 | 32 | 48 | 64 | 80 | 96 | 112| |-- | -- | -- |...
**What is your question?** I am trying to understand how the right_inverse works in the cute. For example, https://github.com/NVIDIA/cutlass/blob/main/test/python/pycute/test_right_inverse.py#L88 The given input is `Layout((2,4,6),(4,1,8))`, I just couldn't figure out why...