Thomas Raoux
Thomas Raoux
### Request description Currently transpose operations are being vectorized (as long as they are aligned) and we usually end up with code that looks like: ``` r0 = load r1...
### Request description Currently unaligned elementwise operations are going through a slow path. In order to generalize we would want to make it go through vectorization, and this is a...
Enable shared memory swizzle transformation as well as picking a good unrolling order for tensorcore
Cutlass added support for float32 emulation using TF32 tensorcore operations. In MLIR we have representations for mma.sync for TF32. We should differentiate mma.sync for float32 and tf32 and have a...
This enable vectorization for some convolution in order to improve performance. This will still generate very sub optimal code but gives a better baseline.
Promoting C matrix allows better memory access patterns for the store to global memory. This allows simplify handling fusion with ops when tensorcore is used since we go through shared...
### What happened? when lowering mma.sync TF32 the inputs are currently f32 as we don't have TF32 as a native type. For now we should just cast from float32 to...
This adds two example from resnet for convolution and convolution with padding. This will allow to start developing transform dialect based codegen for those cases.
The two are identical, this will enforce clean separation and will reduce the amount of code to maintain.
Also improve the coverage of TMA load/store by testing multiple block size that will use different swizzling formats.