Manish Gupta comments

Results 30 comments of


                                            Manish Gupta

add Conv singlestage

We do have [singlestage mma](https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/threadblock/mma_singlestage.h) pipeline for GEMMs. Do we have some use cases for T4 where single stage wins over 2-staged pipeline?

Add support for TensorCore F32 emulation through TF32

self tag @manishucsd

Add support for TensorCore F32 emulation through TF32

OptionalAttr: Present OR Not Present. Follow `bypassL1` which uses the same datatype Present : Allowed to use TF32 lowering given that the data type for the F32 (update the verifier)...

Add support for TensorCore F32 emulation through TF32

In progress here: https://reviews.llvm.org/D130294

Add support for TensorCore F32 emulation through TF32

The next steps here are to use the added OptionalAttr `tf32Enabled` and enum `MmaSyncF32Lowering` to enable support for `TF32x3` a.k.a. F32 emulation through TensorCores.

Fix performance of mma sync

**Summary** - Scheduling Shared Memory loads (`ldsm`) and math (`mma.sync`) operations shows performance gains. We are now at 70us for the GEMM we are measuring (3456x1024x2048xf16). - We reduced it...

Fix performance of mma sync

(1) Support GEMM Pipelining without Epilogue Peeling is done and merged. PR [#10388](https://github.com/iree-org/iree/pull/10388) on supporting GEMM pipelining without epilogue peeling (Unpeeled Epilogue). - Unpeeled epilogue is shorter and tighter, but...

Fix lowering of mma.sync to nvvm for TF32

For a separate issue [#9394](https://github.com/iree-org/iree/issues/9394) and [LLVM patch](https://reviews.llvm.org/D130294), we have added an optional attribute `tf32Enabled` to allow the IR to be aware of MmaSyncOp input data type is TF32 and...

Fix performance of mma sync

Progress on bullet (2): Handles native sizes for n`nvgpu.mma.sync` and `nvgpu.ldmatrix` are ready to start merging into llvm/llvm-project and iree-org/iree. (i) [llvm/llvm-project](https://github.com/manishucsd/llvm-project/compare/41b3beb7614ae7da833bf8330103ddd0a86e528b...manishucsd:llvm-project:llvm_unrolling_mma_ops#diff-b440b410c31f183ebdcce48e43d7fe089095fddb531232ef23b6508046479b9dR196), and - iree-org/iree-llvm-fork integrate process is running behind,...

Fix performance of mma sync

We have pushed the changes to improve Ampere Tensor core mma.sync performance for F16 and F32. We are now tracking performance issues and further improvements in smaller PRs. I think...