Jack Kosaian
Jack Kosaian
Yes. For an example, you can modify the example [14_ampere_tf32_tensorop_gemm](https://github.com/NVIDIA/cutlass/blob/master/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu) to use double precision. To do so, you can change ElementAccumulator, ElementInputA, and ElementInputB on [these lines](https://github.com/NVIDIA/cutlass/blob/master/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu#L170) to be of...
Yes, you're correct. This is being fixed in https://github.com/NVIDIA/cutlass/pull/1451
The CUTLASS Python interface does support s8 GEMMs. Unit tests that show examples of using these are [here](https://github.com/NVIDIA/cutlass/blob/main/test/python/cutlass/gemm/gemm_s8_sm80.py) and [here](https://github.com/NVIDIA/cutlass/blob/main/test/python/cutlass/gemm/gemm_s8_sm90.py). The CUTLASS Python interface does not currently support s4. You...
Thanks! I thought this was being tracked in our CI, but it turns out that the unit tests related to PyTorch extension emission all involved emitting CUTLASS 2.x kernels. I'm...
`sk_regions` indicate the number of sub-partitions of the `sk_tiles` that will be covered by groups of stream-K blocks. You can see that, by default, this value is 1: all stream-K...
The concept of a cohort is a structuring of the assignment of output tiles to CTAs that tries to achieve high L2 cache reuse. It's attempting to mirror the concept...
Part of this is just what you mentioned: a missing mapping for sqrt. We would first need to add `sqrt` in `include/cutlass/functional.h`. However, we'd also need to add the mapping...
We haven't yet done the plumbing to emit the correct EVT arguments structures for creating a PyTorch extension for a kernel that uses EVT. Apologies that this hasn't been better...
@mhoemmen, can you take a look?
cc @apuaaChen for thoughts on how to do this with CUTLASS 3.x SM80 EVT (likely would need some added ops)