Yuanqiang Liu
Yuanqiang Liu
There are about 20% performance difference between cutlass profiler‘s GemmUniversal kernel and my Gemm kernel (they look like same kernel). **GPU: T4, persistent mode: ON, locked on 1590MHz** NVCC: 11.1...
Fold `mhlo.transpose` with non-splat constant.
Fuse dilated conv2d with fp16. BTW, I have two questions to ask: 1. Why the Batch dimension should not be dynamic? 2. Why the padding mode set to `SAME` when...
The code just copied from `tensorflow/compiler/mlir/lite/stablehlo/transforms/unfuse_batch_norm_pass.cc`
… to log ERROR message when call SetPriority on host stream
# Checklist - [x] The title and commit message(s) are descriptive. - [ ] Small commits made to fix your PR have been squashed to avoid history pollution. - [...
* to distinguish which `mhlo.slice` could be non-stride subview
* as title