jdk 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction

Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.

To avoid introducing new nodes, this patch adds a bool field in AddReductionVF/D to distinguish whether they require strict order. It also removes UnorderedReductionNode and adds a virtual function bool requires_strict_order() in ReductionNode. Besides AddReductionVF/D, other reduction nodes' requires_strict_order() have a fixed value.

With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.

[AArch64] On Neon, non strictly-ordered AddReductionVF/D cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].

This patch adds matching rules for non strictly-ordered AddReductionVF/D.

No effects on other platforms.

[Performance] FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).

ADDLanes

Benchmark                 Before     After      Unit
FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms

Final code is as below:

Before: fadda z17.s, p7/m, z17.s, z16.s After:

        faddp        v17.4s, v21.4s, v21.4s
        faddp        s18, v17.2s
        fadd         s18, s18, s19

[Test] Full jtreg passed on AArch64 and x86.

[1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 [2] https://bugs.openjdk.org/browse/JDK-8275275 [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316

Progress

[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue
[x] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Integration blocker

⚠️ Title mismatch between PR and JBS for issue JDK-8320725

Issue

JDK-8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add and mul reduction (Enhancement - P4) ⚠️ Title mismatch between PR and JBS.

Reviewers

Guoxiong Li (@lgxbslgx - Reviewer) ⚠️ Review applies to f8f79ac2
Andrew Haley (@theRealAph - Reviewer) ⚠️ Review applies to bdd0fabf

Contributors

Eric Liu <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034
$ git checkout pull/18034

Update a local copy of the PR:
$ git checkout pull/18034
$ git pull https://git.openjdk.org/jdk.git pull/18034/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18034

View PR using the GUI difftool:
$ git pr show -t 18034

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18034.diff

Webrev

Link to Webrev Comment

Feb 27 '24 21:02 Bhavana-Kilambi

:wave: Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Feb 27 '24 21:02 bridgekeeper[bot]

@Bhavana-Kilambi The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Feb 27 '24 21:02 openjdk[bot]

/contributor add @e1iu

Feb 27 '24 21:02 Bhavana-Kilambi

@Bhavana-Kilambi Contributor Eric Liu <[email protected]> successfully added.

Feb 27 '24 21:02 openjdk[bot]

Webrevs

Feb 27 '24 21:02 mlbridge[bot]

Before this patch, the ReductionNode is strict-ordered and the UnorderedReductionNode is unordered. If we want an unordered reduction node, we just extends UnorderedReductionNode instead of ReductionNode. Now you think every concrete node object should has the right to judge whether it is strict-ordered or not by itself (using the method requires_strict_order). If my understanding is right, this patch is a nice refactor. Some concrete comments are shown below.

Hi, thanks for your review. You are correct. With this patch, all reduction type nodes (whether ordered or unordered) can decide if they are ordered/unordered based on 'requires_strict_order' method. In addition to that, they can also define a field - 'requires_strict_order' like in the floating-point add reduction node definitions in case there are multiple paths where a specific node can be either ordered or unordered (instead of returning a fixed true/false value). I will make changes you suggested and update patch accordingly. Thanks again!

Feb 29 '24 11:02 Bhavana-Kilambi

Can I please ask for more reviews for this PR? Thank you in advance!

Mar 12 '24 14:03 Bhavana-Kilambi

Looks interesting! Will have a look at it. /reviewers 2 reviewer

Mar 12 '24 16:03 eme64

@eme64 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

Mar 12 '24 16:03 openjdk[bot]

A suggestion about naming: We now have a few synonyms:

Unordered reduction
non-strict order reduction
associative reduction

I think I introduced the "unordered" one. Not proud of it any more. I think we should probably use (non) associative everywhere. That is the technical/mathematical term. We can use synonyms in the comments to make the explanation more clear though.

Mar 12 '24 17:03 eme64

On a more visionary note:

We should make sure that the actual ReductionNode gets moved out of the loop, when possible. JDK-8309647 [Vector API] Move Reduction outside loop when possible We have an RFE for that, I have not yet have time or priority for it. Of course the user can already move it out of the loop themselves.

If the ReductionNode is out of the loop, then you usually just have a very cheap accumulation inside the loop, a MulVF for example. That would certainly be cheap enough to allow vectorization.

So in that case, your optimization here should not just affect SVE, but also NEON and x86.

Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.

What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.

Mar 12 '24 17:03 eme64

@eme64 Thank you so much for your review and feedback comments. Here are my responses to your questions -

> So in that case, your optimization here should not just affect SVE, but also NEON and x86. From what I understand, even when the reduction nodes are hoisted out of the loop, it would still generate AddReductionVF/VD nodes (fewer as we accumulate inside the loop now) and based on the choice of order the corresponding backend match rules (as included in this patch) should generate the expected instruction sequence. I don't think we would need any code changes for Neon/SVE after hoisting the reductions out of the loop. Please let me know if my understanding is incorrect.

> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here? I am unable to tag them for some reason.

> What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. To give a background : We observed a significant performance degradation with SVE instructions compared to Neon for this testcase - FloatMaxVector.ADDLanes on a 128-bit SVE machine. It generates the SVE "fadda" instruction which is a strictly-ordered floating-point add reduction instruction. As it has a higher cost compared to the Neon implementation for FP add reduction, the performance with "fadda" was ~66% worse compared to Neon. As VectorAPI does not impose any rules on FP ordering, it could have generated the faster non-strict Neon instructions instead (on a 128-bit SVE machine). The reason we included a flag "requires_strict_order" to mark a reduction node as strictly-ordered or non-strictly ordered and generate the corresponding backend instructions. On aarch64, this patch only affects the 128-bit SVE machines. On SVE machines >128bits, the "fadda" instruction is generated as it was before this patch. There's no change on Neon as well - the non-strict Neon instructions are generated with VectorAPI and no auto-vectorization is allowed for FP reduction nodes. Although this change was done keeping SVE in mind, this patch can help generate strictly ordered or non-strictly ordered code on other platforms as well (if they have different implementations for both) and also simplifies the IdealGraph a bit by removing the UnorderedReductionNode. I will make changes regarding the naming in the patch in the next commit. Thanks.

Mar 13 '24 17:03 Bhavana-Kilambi

@Bhavana-Kilambi This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add and mul reduction

Co-authored-by: Eric Liu <[email protected]>
Reviewed-by: gli, epeter, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 16 new commits pushed to the master branch:

83b34410e326c47f357a37c3a337b7dedb8cbbda: 8322811: jcmd System.dump_map help info has conflicting statements
8aa35cacfcc94d261de102b628eb954c71eae98e: 8333833: Remove the use of ByteArrayLittleEndian from UUID::toString
de55db2352f84c101f8197ee7aca80d72807fbc5: 8333522: JFR SwapSpace event might read wrong free swap space size
a941397327972f130e683167a1b429f17603df46: 8329031: CPUID feature detection for Advanced Performance Extensions (Intel® APX)
8d2f9e57c3797c01c84df007f4d2bfdcd645d0c0: 8333749: Consolidate ConstantDesc conversion in java.base
a6fc2f839a5e494b940ee473cbd942ec5f884324: 8333412: [s390x] Add support for branch on count instruction
cf677c901e70d98404ec9cc3d75a93926e02fcd2: 8333823: Update --release 23 symbol information for JDK 23 build 26
18e7d7b5e710b24e49b995777906a197e35795e6: 8333716: Shenandoah: Check for disarmed method before taking the nmethod lock
c37d02aef38da178fcf56e3c5cccc41cc5175421: 8312412: Uninitialized klassVtable::_verify_count field
17bd483ff01e463cef45824f0c1296a8f3e782c8: 8333680: com/sun/tools/attach/BasicTests.java fails with "SocketException: Permission denied: connect"
... and 6 more: https://git.openjdk.org/jdk/compare/486dee2cf420981b4c8111c24c5fbd27aceb238b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@lgxbslgx, @eme64, @theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Mar 13 '24 20:03 openjdk[bot]

There seem to be a couple of failures. Linux cross-compile fails but this seems to be more of a build problem than something caused by my patch. There's one JTREG test failure on x86 - tools/javac/patterns/Exhaustiveness.java which seemed to fail due to - Agent error: java.lang.Exception: Agent 8 timed out with a timeout of 480 seconds; I re-ran this specific test manually on an x86 machine and this test passed.

Mar 15 '24 19:03 Bhavana-Kilambi

> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here? I am unable to tag them for some reason.

I agree, as per following Vector API documentation, backends are free to deviate from JVM specification (15.18.2) which enforces non-associativity on FP operation. https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc:~:text=Certain%20associative%20operations,this%20machine%20code. You may create a follow-up RFE for x86 side of optimization and assign it to me.

Mar 18 '24 08:03 jatin-bhateja

Thank you Jatin. I will do that once this PR is approved and merged upstream.

Mar 18 '24 09:03 Bhavana-Kilambi

Please rename this to "8320725: AArch64: C2: Add "is_associative" flag for floating-point add-reduction"

Mar 18 '24 17:03 theRealAph

You probably want to change the name of the PR again: Add "is_associative" flag for floating-point add-reduction -> 8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add-reduction

Apr 05 '24 10:04 eme64

@eme64 Thank you for the notification and your review ! I did change the title in the bug but forgot to do that in the PR as well. Changed that now. I will make the changes you suggested in the next PS.

Apr 05 '24 11:04 Bhavana-Kilambi

I see that this test - "jdk/java/util/HashMap/WhiteBoxResizeTest.java" seems to have failed on an x86 machine. I have tested it on my local x86 machine and the test passes.

Apr 22 '24 15:04 Bhavana-Kilambi

Hi @eme64 @theRealAph I have uploaded the latest patch addressing all review comments. Can I please ask for more reviews. Thank you ..

May 01 '24 08:05 Bhavana-Kilambi

Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)

May 07 '24 13:05 eme64

Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)

Thanks for the review. I will update my patch soon with your suggestions. Apologies for not making changes in the test* directory regarding the UnorderedReduction node which is now deleted but some tests seem to exist.

May 07 '24 15:05 Bhavana-Kilambi

/reviewers 2 reviewer I'll look at it again, once my concerns are all addressed. @Bhavana-Kilambi feel free to ping me again for that.

May 08 '24 12:05 eme64

@eme64 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

May 08 '24 12:05 openjdk[bot]

I just realized that there is no regression test. And I think it would be nice to have one.

Also, we should add some sort of message to the dump if the ReductionNode has the requires_strict_order on or off. I think that could be done in dump_spec.

You could do it similar to:
#ifndef PRODUCT
void VectorMaskCmpNode::dump_spec(outputStream *st) const {
  st->print(" %d #", _predicate); _type->dump_on(st);
}
#endif // PRODUCT
This would actually allow you to create a IR test!

You would check that the AddReductionVNode is annotated correctly. You need some VectorAPI tests, and some SuperWord auto-vectorization tests.

How does that sound? That would ensure that nobody can easily destroy your RFE, at least not in the IR.

Hi @eme64 , thanks for the suggestion. I can add the dump_spec as suggested (which would print if the _requires_strict_order flag is enabled/disabled) but I am not sure if I fully understand what's expected in the JTREG tests. Should I be verifying the -XX:+PrintIdeal output to make sure the correct message is being printed for the ReductionV* nodes?

May 08 '24 20:05 Bhavana-Kilambi

I am not sure if I fully understand what's expected in the JTREG tests. Should I be verifying the -XX:+PrintIdeal output to make sure the correct message is being printed for the ReductionV* nodes?

Yes, the IR framework basically does regex matching against the PrintIdeal graph. For example: counts = {IRNode.STORE_VECTOR, ">0"} in the @IR rule executes the regex for the store vector, and checks if we find more than zero occurances.

Maybe you can just use a regex string directly for your special IR rule. Alternatively, you could have them in the IRNode class, but not sure that's worth it.

May 13 '24 05:05 eme64

@eme64 Thanks for the clarification. I understand the usage of counts in the IR tests. Just that I got a bit confused by some of your earlier statements. We do actually have a test to make sure AddReductionVF/VD and MulReductionVF/VD are not generated on aarch64 NEON machines - test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java. I can modify this test to include UseSVE > 0 case as well and will also add a separate JTREG test for the VectorAPI tests. Hope that's ok..

May 13 '24 10:05 Bhavana-Kilambi

@Bhavana-Kilambi I know we have the tests in test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java, and some other reduction tests. But these do not do the specific think I would like to see.

I would like this:

Add no_strict_order vs requires_strict_order or similar to dump_spec.
IR match not just that there is the correct ReductionNode, but also that it has the no_strict_order or requires_strict_order in its dump. You can do that by using a custom regex string, rather than IRNode.STORE_VECTOR or similar.
Then, create different tests, some where we expect ordered, some unordered vectors. Use Vector API and SuperWord examples.

Does that make sense?

May 13 '24 11:05 eme64

@Bhavana-Kilambi I know we have the tests in test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java, and some other reduction tests. But these do not do the specific think I would like to see.

I would like this:
* Add `no_strict_order` vs `requires_strict_order` or similar to `dump_spec`.

* IR match not just that there is the correct `ReductionNode`, but also that it has the `no_strict_order` or `requires_strict_order` in its dump. You can do that by using a custom regex string, rather than `IRNode.STORE_VECTOR` or similar.

* Then, create different tests, some where we expect ordered, some unordered vectors. Use Vector API and SuperWord examples.
Does that make sense?

Yes, I am doing exactly that. Just that for the superword(auto-vec) case, I am just modifying the AddReduction related tests in TestDisableAutoVectOpcodes.java to incorporate the case with UseSVE > 0 as well and match the regex as per the dump_spec output.

May 13 '24 12:05 Bhavana-Kilambi

jdk jdk copied to clipboard

8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction

Progress

Integration blocker

Issue

Reviewers

Contributors

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard