triton
triton copied to clipboard
[AMD] Generalize optimize epilogue pattern
This PR adds support of following pattern in optimize epilogue pass: ->ConvertLayout -> ElementwiseOp -> StoreOp
LGTM, but could you add a lit test to make sure that we don't regress this optimization in the future?
@ptillet
LGTM, but could you add a lit test to make sure that we don't regress this optimization in the future?
Will do. I have a question: Do you think Nvidia can benefit from this generalization?
I have a question: Do you think Nvidia can benefit from this generalization?
in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf
in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf
I am actually not too sure; H100 can definitely store mma layout efficiently w/ TMAs (whenever we support that), and even on A100 i've seen that if this patterns happens in an inner loop (original flash attention bwd pass for dq) then not converting is actually faster (probably due to lack of sync)
in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf
I am actually not too sure; H100 can definitely store mma layout efficiently w/ TMAs (whenever we support that),
TMA still copies from shared memory not from register. I assume we could still convert to shared first.
and even on A100 i've seen that if this patterns happens in an inner loop (original flash attention bwd pass for dq) then not converting is actually faster (probably due to lack of sync)
Right, I have also seen it in some cases but I have seen it being inefficient in a bunch of cases. We could potentially propagate some layout to StoreOp directly in RemoveConvertLayout but I'm not sure we will be able to find a good criteria for when it's beneficial.
looks like build is broken
@ThomasRaoux Sorry. I'll update this PR and fix error when #3081 is merged
Just noticed that https://github.com/openai/triton/pull/3099 removed this pass in Nvidia backend. I'll remove USE_ROCM in this file, since now there are nothing to compare pass from this PR to
@ptillet @ThomasRaoux This PR is ready for review, PTAL
+cc @zhanglx13