triton
triton copied to clipboard
CSE and LICM don't work as expected with exp in the loop
I noticed that
CSE and LICM don't work as expected with
exp
in the loop
is mentioned in /python/triton/ops/flash_attention.py
(credits to Adam P. Goucher @apgoucher )
Can someone explain to me the reason for saying this? Has this problem been solved? Thank you so much.
https://github.com/openai/triton/blob/e2bdc8973feb41fc60d31472bdbe3b80c3ad8405/python/triton/ops/flash_attention.py#L59-L63
This may be an issue with the upstream MLIR, I will investigate first.
I printed out mlir and found that the exp
operation will be constructed in this form%146 = tt.extern_elementwise %145 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #mma}>>) -> tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #mma}>> loc(#loc36)
. Why not use the mlir.math
dialect here? And I found that exp
will be converted to exp2
in convert-triton-gpu-to-llvmpass
. I don’t know much about this. Context, if we build math.exp
directly and then convert it to exp2
, this will not prevent the compiler optimization of mlir.