Results 14 issues of botbw

### 🚀 The feature, motivation and pitch # Motivation For complicated `DTensor` redistribution (e.g. `[S(0), S(1)] -> [S(1), S(0)]`), it's likely that only GPU1 and GPU2 need to communicate (when...

Hey I was running `sgemm_sm80.cu` example and printing out the tensor layouts, the code triggered a segfault when I added more printing logs, and `compute-sanitizer` shows it was due to...

Roadmap: - [ ] Clear TODOs # SM8x - [x] bf16/fp16 - [x] customized metadata layout - [x] tf32 - [ ] precision issue due to using fp32 as tf32...

The `gemm_sp` (v1) has been supported on sm80, sm89, and sm90. It leverages the CUTLASS backend and requires the metadata to conform to specific CUTE layouts, which restricts flexibility and...