triton icon indicating copy to clipboard operation
triton copied to clipboard

[AMD] Generalize optimize epilogue pattern

Open binarman opened this issue 1 year ago • 7 comments

This PR adds support of following pattern in optimize epilogue pass: ->ConvertLayout -> ElementwiseOp -> StoreOp

binarman avatar Feb 01 '24 19:02 binarman

LGTM, but could you add a lit test to make sure that we don't regress this optimization in the future?

ptillet avatar Feb 02 '24 08:02 ptillet

@ptillet

LGTM, but could you add a lit test to make sure that we don't regress this optimization in the future?

Will do. I have a question: Do you think Nvidia can benefit from this generalization?

binarman avatar Feb 02 '24 16:02 binarman

I have a question: Do you think Nvidia can benefit from this generalization?

in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf

ThomasRaoux avatar Feb 02 '24 17:02 ThomasRaoux

in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf

I am actually not too sure; H100 can definitely store mma layout efficiently w/ TMAs (whenever we support that), and even on A100 i've seen that if this patterns happens in an inner loop (original flash attention bwd pass for dq) then not converting is actually faster (probably due to lack of sync)

ptillet avatar Feb 02 '24 17:02 ptillet

in general on other HW the mma layout is not a very efficient layout to store directly from so doing the convert is better for perf

I am actually not too sure; H100 can definitely store mma layout efficiently w/ TMAs (whenever we support that),

TMA still copies from shared memory not from register. I assume we could still convert to shared first.

and even on A100 i've seen that if this patterns happens in an inner loop (original flash attention bwd pass for dq) then not converting is actually faster (probably due to lack of sync)

Right, I have also seen it in some cases but I have seen it being inefficient in a bunch of cases. We could potentially propagate some layout to StoreOp directly in RemoveConvertLayout but I'm not sure we will be able to find a good criteria for when it's beneficial.

ThomasRaoux avatar Feb 02 '24 17:02 ThomasRaoux

looks like build is broken

ThomasRaoux avatar Feb 06 '24 19:02 ThomasRaoux

@ThomasRaoux Sorry. I'll update this PR and fix error when #3081 is merged

binarman avatar Feb 07 '24 14:02 binarman

Just noticed that https://github.com/openai/triton/pull/3099 removed this pass in Nvidia backend. I'll remove USE_ROCM in this file, since now there are nothing to compare pass from this PR to

binarman avatar Feb 29 '24 16:02 binarman

@ptillet @ThomasRaoux This PR is ready for review, PTAL

binarman avatar Mar 01 '24 13:03 binarman

+cc @zhanglx13

binarman avatar Mar 01 '24 16:03 binarman