Kunwar Grover
Kunwar Grover
The prefetch pass assumes that shared memory can be reused in the prologue. This may not be true when nested loops are involved, so we need to explicitly insert a...
This flag https://github.com/iree-org/iree/blob/d834aa7357179e0d806f3634d2efe3af2fa45171/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp#L90 enables software prefetching for kernels using shared memory. Software prefetching is disabled by default, and only enabled by this flag. Over time, prefetching became part of GPU...
This flag https://github.com/iree-org/iree/blob/d834aa7357179e0d806f3634d2efe3af2fa45171/compiler/plugins/target/ROCM/ROCMTarget.cpp#L93 sets a waves-per-eu attribute for llvm compilation on **every dispatch** to give the register allocator a hint. https://github.com/iree-org/iree/pull/17365 introduced a way to specify these llvm func attributes...
This flag https://github.com/iree-org/iree/blob/d834aa7357179e0d806f3634d2efe3af2fa45171/compiler/src/iree/compiler/Codegen/Common/PolynomialApproximationPass.cpp#L17 disables polynomial approximation for most math dialect operations, for hardware that supports these math operations directly. It looks like some backends rely on this flag for performance...
Depends on https://github.com/iree-org/iree/pull/18780 and https://github.com/iree-org/iree/pull/18771
Post-softmax, the range of output is between 0, 1. For low-precision types (like fp8), we scale the output range to be between 0, fpMax, so we can use more of...
transfer_gather is distributed just like transfer_read on non gathered dimensions and like vector.gather on gathered dimensions.