xla icon indicating copy to clipboard operation
xla copied to clipboard

gemv rewrite pass independent of triton

Open wenscarl opened this issue 1 year ago • 0 comments

In the decoding stage of some MOE model inferences, XLA squeezes dimensions of size 1 when sequence length is 1. For example, it transforms a shape of [1, 4096] into [4096], resulting in a GEMV operation. When Triton GEMM is disabled, the GEMV rewriter is ineffective, which leads to failures in rewriting FP8 GEMM operations.

wenscarl avatar Aug 08 '24 16:08 wenscarl