Halide [x86] Codegen phaddw, phaddd, and pmaddwd

This PR adds support for int16 -> int32 horizontal widening adds to use pmaddwd, and pattern matches on horizontal adds to use phadd(w | d), which is faster than the permute + padd that we currently generate for such reductions (cite @abadams + llvm-mca because my x86 machine is currently broken).

Also fly-by move of some code that I incorrectly placed in the wrong runtime file in #6677 .

Jul 21 '22 20:07 rootjalex

Yes, it should be fine - the phadd intrinsics have been in LLVM for over a decade sse3 avx2.

Jul 21 '22 20:07 rootjalex

The vector and scalar versions of op_phaddw_285 disagree. Maximum error: 60123

Jul 21 '22 21:07 steven-johnson

Failures were due to a shuffling mistake, I forgot that AVX2 has the 128 bit boundaries for instructions - fixed in 0acae70, and confirmed that we still produce better codegen for reductions.

Jul 22 '22 15:07 rootjalex

There's a trade-off between adding new llvm ir implementations of things, and rewriting Halide Exprs into forms that would match existing patterns. E.g. for the widening add, you could rewrite the Expr so that it matches should_use_dot_product.

My question is: Did you consider that, and does this PR fall on the right side of that line?

Jul 22 '22 18:07 abadams

We don't have any existing patterns for phadd instructions, so I'll only respond to your question in regards to the use of pmaddwd here. I did not consider rewriting within Halide IR, but your suggestion raises a good point - I realize that I did not provide LLVM implementations for the AVX512 dot_product instructions, which could be similarly used. I suspect that your suggest would benefit from greater generality, and would be better suited for extensibility, so I will change this PR to transform all of the horizontal_widening_adds into the pattern that will match against the existing dot_product patterns

Jul 22 '22 18:07 rootjalex

I will say that I don't think these particular patterns are suitable for rewriting to enable should_use_dot_product, but instead I will aim for producing the patterns that will match the VectorReduce dot_product instructions.

Jul 22 '22 18:07 rootjalex

buildbots are green -- is this ready to land?

Jul 26 '22 00:07 steven-johnson

I might close this PR in favor of #6884 , which will make the horizontal_widening_add patterns useful on any architecture that has pmaddwd variants. Just waiting to get feedback on whether that PR is indeed a better way of doing things.

Jul 26 '22 03:07 rootjalex

Monday Morning Review Ping -- where does this PR stand? Is it still likely to be dropped in favor of #6884?

Aug 22 '22 16:08 steven-johnson

Yes, I am closing this in favor of #6884 . If we end up not merging #6884 for some reason, I will re-open this PR.

Aug 22 '22 17:08 rootjalex

Halide Halide copied to clipboard

[x86] Codegen phaddw, phaddd, and pmaddwd

Halide
Halide copied to clipboard