triton [AMD] Generalize optimize epilogue pattern

This PR adds support of following pattern in optimize epilogue pass: ->ConvertLayout -> ElementwiseOp -> StoreOp

Feb 01 '24 19:02 binarman

LGTM, but could you add a lit test to make sure that we don't regress this optimization in the future?

Feb 02 '24 08:02 ptillet

@ptillet

LGTM, but could you add a lit test to make sure that we don't regress this optimization in the future?

Will do. I have a question: Do you think Nvidia can benefit from this generalization?

Feb 02 '24 16:02 binarman

I have a question: Do you think Nvidia can benefit from this generalization?

in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf

Feb 02 '24 17:02 ThomasRaoux

in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf

I am actually not too sure; H100 can definitely store mma layout efficiently w/ TMAs (whenever we support that), and even on A100 i've seen that if this patterns happens in an inner loop (original flash attention bwd pass for dq) then not converting is actually faster (probably due to lack of sync)

Feb 02 '24 17:02 ptillet

in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf

I am actually not too sure; H100 can definitely store mma layout efficiently w/ TMAs (whenever we support that),

TMA still copies from shared memory not from register. I assume we could still convert to shared first.

and even on A100 i've seen that if this patterns happens in an inner loop (original flash attention bwd pass for dq) then not converting is actually faster (probably due to lack of sync)

Right, I have also seen it in some cases but I have seen it being inefficient in a bunch of cases. We could potentially propagate some layout to StoreOp directly in RemoveConvertLayout but I'm not sure we will be able to find a good criteria for when it's beneficial.

Feb 02 '24 17:02 ThomasRaoux

looks like build is broken

Feb 06 '24 19:02 ThomasRaoux

@ThomasRaoux Sorry. I'll update this PR and fix error when #3081 is merged

Feb 07 '24 14:02 binarman

Just noticed that https://github.com/openai/triton/pull/3099 removed this pass in Nvidia backend. I'll remove USE_ROCM in this file, since now there are nothing to compare pass from this PR to

Feb 29 '24 16:02 binarman

@ptillet @ThomasRaoux This PR is ready for review, PTAL

Mar 01 '24 13:03 binarman

+cc @zhanglx13

Mar 01 '24 16:03 binarman

triton triton copied to clipboard

[AMD] Generalize optimize epilogue pattern

triton
triton copied to clipboard