Thomas Raoux issues

Results 13 issues of


                                            Thomas Raoux

Optimize transpose operation using shared memory for LLVMGPU backend

### Request description Currently transpose operations are being vectorized (as long as they are aligned) and we usually end up with code that looks like: ``` r0 = load r1...

performance ⚡

codegen/nvvm

Vectorize unaligned elementwise in LLVMGPU backend

### Request description Currently unaligned elementwise operations are going through a slow path. In order to generalize we would want to make it go through vectorization, and this is a...

performance ⚡

codegen/nvvm

[LLVMGPU] Add shared memory swizzle transformation

Enable shared memory swizzle transformation as well as picking a good unrolling order for tensorcore

Add support for TensorCore F32 emulation through TF32

Cutlass added support for float32 emulation using TF32 tensorcore operations. In MLIR we have representations for mma.sync for TF32. We should differentiate mma.sync for float32 and tf32 and have a...

help wanted

codegen/nvvm

Thomas Raoux

Optimize transpose operation using shared memory for LLVMGPU backend

Vectorize unaligned elementwise in LLVMGPU backend

[LLVMGPU] Add shared memory swizzle transformation

Add support for TensorCore F32 emulation through TF32

[LLVMGPU] Enable vectorization for convolution

[LLVMGPU] Turn on C promotion by default for tensorcore gemm

Fix lowering of mma.sync to nvvm for TF32

Add example/micro-benchmark for interesting conv2d

[NFC][BACKEND] Move to upstream pipeline expander

[BACKEND] Add fallback for TMA used with MMAv2