xla
xla copied to clipboard
gemv rewrite pass independent of triton
In the decoding stage of some MOE model inferences, XLA squeezes dimensions of size 1 when sequence length is 1. For example, it transforms a shape of [1, 4096] into [4096], resulting in a GEMV operation. When Triton GEMM is disabled, the GEMV rewriter is ineffective, which leads to failures in rewriting FP8 GEMM operations.