[MLIR][Attention] Implement gemm(i8)-dequantizelinear-softmax(fp16)-gemm(fp16) lowering

Open manupak opened this issue 1 year ago • 0 comments

Problem Description

This ticket is to implement gemm(i8)-dequantizelinear-softmax(fp16)-gemm(fp16) pattern to do a partial i8 attention kernel in rocmlir.

Here is one of the examples test we currently have working : https://github.com/ROCm/rocMLIR/blob/develop/mlir/test/fusion/pr-e2e/attention/mixr-attention-first-gemm-i8-f16.mlir

Operating System

Any

CPU

Any

GPU

AMD Instinct MI300X, AMD Instinct MI250X, AMD Instinct MI250, AMD Instinct MI210

Other

No response

ROCm Version

ROCm 6.0.0

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Mar 27 '24 16:03 manupak