jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction

Open Bhavana-Kilambi opened this issue 1 year ago • 31 comments

Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.

To avoid introducing new nodes, this patch adds a bool field in AddReductionVF/D to distinguish whether they require strict order. It also removes UnorderedReductionNode and adds a virtual function bool requires_strict_order() in ReductionNode. Besides AddReductionVF/D, other reduction nodes' requires_strict_order() have a fixed value.

With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.

[AArch64] On Neon, non strictly-ordered AddReductionVF/D cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].

This patch adds matching rules for non strictly-ordered AddReductionVF/D.

No effects on other platforms.

[Performance] FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).

ADDLanes

Benchmark                 Before     After      Unit
FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms

Final code is as below:

Before: fadda z17.s, p7/m, z17.s, z16.s After:

        faddp        v17.4s, v21.4s, v21.4s
        faddp        s18, v17.2s
        fadd         s18, s18, s19


[Test] Full jtreg passed on AArch64 and x86.

[1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 [2] https://bugs.openjdk.org/browse/JDK-8275275 [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316


Progress

  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [x] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8320725

Issue

  • JDK-8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add and mul reduction (Enhancement - P4) ⚠️ Title mismatch between PR and JBS.

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034
$ git checkout pull/18034

Update a local copy of the PR:
$ git checkout pull/18034
$ git pull https://git.openjdk.org/jdk.git pull/18034/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18034

View PR using the GUI difftool:
$ git pr show -t 18034

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18034.diff

Webrev

Link to Webrev Comment

Bhavana-Kilambi avatar Feb 27 '24 21:02 Bhavana-Kilambi

:wave: Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Feb 27 '24 21:02 bridgekeeper[bot]

@Bhavana-Kilambi The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Feb 27 '24 21:02 openjdk[bot]

/contributor add @e1iu

Bhavana-Kilambi avatar Feb 27 '24 21:02 Bhavana-Kilambi

@Bhavana-Kilambi Contributor Eric Liu <[email protected]> successfully added.

openjdk[bot] avatar Feb 27 '24 21:02 openjdk[bot]

Before this patch, the ReductionNode is strict-ordered and the UnorderedReductionNode is unordered. If we want an unordered reduction node, we just extends UnorderedReductionNode instead of ReductionNode. Now you think every concrete node object should has the right to judge whether it is strict-ordered or not by itself (using the method requires_strict_order). If my understanding is right, this patch is a nice refactor. Some concrete comments are shown below.

Hi, thanks for your review. You are correct. With this patch, all reduction type nodes (whether ordered or unordered) can decide if they are ordered/unordered based on 'requires_strict_order' method. In addition to that, they can also define a field - 'requires_strict_order' like in the floating-point add reduction node definitions in case there are multiple paths where a specific node can be either ordered or unordered (instead of returning a fixed true/false value). I will make changes you suggested and update patch accordingly. Thanks again!

Bhavana-Kilambi avatar Feb 29 '24 11:02 Bhavana-Kilambi

Can I please ask for more reviews for this PR? Thank you in advance!

Bhavana-Kilambi avatar Mar 12 '24 14:03 Bhavana-Kilambi

Looks interesting! Will have a look at it. /reviewers 2 reviewer

eme64 avatar Mar 12 '24 16:03 eme64

@eme64 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

openjdk[bot] avatar Mar 12 '24 16:03 openjdk[bot]

A suggestion about naming: We now have a few synonyms:

Unordered reduction
non-strict order reduction
associative reduction

I think I introduced the "unordered" one. Not proud of it any more. I think we should probably use (non) associative everywhere. That is the technical/mathematical term. We can use synonyms in the comments to make the explanation more clear though.

eme64 avatar Mar 12 '24 17:03 eme64

On a more visionary note:

We should make sure that the actual ReductionNode gets moved out of the loop, when possible. JDK-8309647 [Vector API] Move Reduction outside loop when possible We have an RFE for that, I have not yet have time or priority for it. Of course the user can already move it out of the loop themselves.

If the ReductionNode is out of the loop, then you usually just have a very cheap accumulation inside the loop, a MulVF for example. That would certainly be cheap enough to allow vectorization.

So in that case, your optimization here should not just affect SVE, but also NEON and x86.

Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.

What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.

eme64 avatar Mar 12 '24 17:03 eme64

@eme64 Thank you so much for your review and feedback comments. Here are my responses to your questions -

> So in that case, your optimization here should not just affect SVE, but also NEON and x86. From what I understand, even when the reduction nodes are hoisted out of the loop, it would still generate AddReductionVF/VD nodes (fewer as we accumulate inside the loop now) and based on the choice of order the corresponding backend match rules (as included in this patch) should generate the expected instruction sequence. I don't think we would need any code changes for Neon/SVE after hoisting the reductions out of the loop. Please let me know if my understanding is incorrect.

> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here? I am unable to tag them for some reason.

> What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. To give a background : We observed a significant performance degradation with SVE instructions compared to Neon for this testcase - FloatMaxVector.ADDLanes on a 128-bit SVE machine. It generates the SVE "fadda" instruction which is a strictly-ordered floating-point add reduction instruction. As it has a higher cost compared to the Neon implementation for FP add reduction, the performance with "fadda" was ~66% worse compared to Neon. As VectorAPI does not impose any rules on FP ordering, it could have generated the faster non-strict Neon instructions instead (on a 128-bit SVE machine). The reason we included a flag "requires_strict_order" to mark a reduction node as strictly-ordered or non-strictly ordered and generate the corresponding backend instructions. On aarch64, this patch only affects the 128-bit SVE machines. On SVE machines >128bits, the "fadda" instruction is generated as it was before this patch. There's no change on Neon as well - the non-strict Neon instructions are generated with VectorAPI and no auto-vectorization is allowed for FP reduction nodes. Although this change was done keeping SVE in mind, this patch can help generate strictly ordered or non-strictly ordered code on other platforms as well (if they have different implementations for both) and also simplifies the IdealGraph a bit by removing the UnorderedReductionNode. I will make changes regarding the naming in the patch in the next commit. Thanks.

Bhavana-Kilambi avatar Mar 13 '24 17:03 Bhavana-Kilambi

@Bhavana-Kilambi This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add and mul reduction

Co-authored-by: Eric Liu <[email protected]>
Reviewed-by: gli, epeter, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 16 new commits pushed to the master branch:

  • 83b34410e326c47f357a37c3a337b7dedb8cbbda: 8322811: jcmd System.dump_map help info has conflicting statements
  • 8aa35cacfcc94d261de102b628eb954c71eae98e: 8333833: Remove the use of ByteArrayLittleEndian from UUID::toString
  • de55db2352f84c101f8197ee7aca80d72807fbc5: 8333522: JFR SwapSpace event might read wrong free swap space size
  • a941397327972f130e683167a1b429f17603df46: 8329031: CPUID feature detection for Advanced Performance Extensions (Intel® APX)
  • 8d2f9e57c3797c01c84df007f4d2bfdcd645d0c0: 8333749: Consolidate ConstantDesc conversion in java.base
  • a6fc2f839a5e494b940ee473cbd942ec5f884324: 8333412: [s390x] Add support for branch on count instruction
  • cf677c901e70d98404ec9cc3d75a93926e02fcd2: 8333823: Update --release 23 symbol information for JDK 23 build 26
  • 18e7d7b5e710b24e49b995777906a197e35795e6: 8333716: Shenandoah: Check for disarmed method before taking the nmethod lock
  • c37d02aef38da178fcf56e3c5cccc41cc5175421: 8312412: Uninitialized klassVtable::_verify_count field
  • 17bd483ff01e463cef45824f0c1296a8f3e782c8: 8333680: com/sun/tools/attach/BasicTests.java fails with "SocketException: Permission denied: connect"
  • ... and 6 more: https://git.openjdk.org/jdk/compare/486dee2cf420981b4c8111c24c5fbd27aceb238b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@lgxbslgx, @eme64, @theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk[bot] avatar Mar 13 '24 20:03 openjdk[bot]

There seem to be a couple of failures. Linux cross-compile fails but this seems to be more of a build problem than something caused by my patch. There's one JTREG test failure on x86 - tools/javac/patterns/Exhaustiveness.java which seemed to fail due to - Agent error: java.lang.Exception: Agent 8 timed out with a timeout of 480 seconds; I re-ran this specific test manually on an x86 machine and this test passed.

Bhavana-Kilambi avatar Mar 15 '24 19:03 Bhavana-Kilambi

> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here? I am unable to tag them for some reason.

I agree, as per following Vector API documentation, backends are free to deviate from JVM specification (15.18.2) which enforces non-associativity on FP operation. https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc:~:text=Certain%20associative%20operations,this%20machine%20code. You may create a follow-up RFE for x86 side of optimization and assign it to me.

jatin-bhateja avatar Mar 18 '24 08:03 jatin-bhateja

Thank you Jatin. I will do that once this PR is approved and merged upstream.

Bhavana-Kilambi avatar Mar 18 '24 09:03 Bhavana-Kilambi

Please rename this to "8320725: AArch64: C2: Add "is_associative" flag for floating-point add-reduction"

theRealAph avatar Mar 18 '24 17:03 theRealAph

You probably want to change the name of the PR again: Add "is_associative" flag for floating-point add-reduction -> 8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add-reduction

eme64 avatar Apr 05 '24 10:04 eme64

@eme64 Thank you for the notification and your review ! I did change the title in the bug but forgot to do that in the PR as well. Changed that now. I will make the changes you suggested in the next PS.

Bhavana-Kilambi avatar Apr 05 '24 11:04 Bhavana-Kilambi

I see that this test - "jdk/java/util/HashMap/WhiteBoxResizeTest.java" seems to have failed on an x86 machine. I have tested it on my local x86 machine and the test passes.

Bhavana-Kilambi avatar Apr 22 '24 15:04 Bhavana-Kilambi

Hi @eme64 @theRealAph I have uploaded the latest patch addressing all review comments. Can I please ask for more reviews. Thank you ..

Bhavana-Kilambi avatar May 01 '24 08:05 Bhavana-Kilambi

Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)

eme64 avatar May 07 '24 13:05 eme64

Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)

Thanks for the review. I will update my patch soon with your suggestions. Apologies for not making changes in the test* directory regarding the UnorderedReduction node which is now deleted but some tests seem to exist.

Bhavana-Kilambi avatar May 07 '24 15:05 Bhavana-Kilambi

/reviewers 2 reviewer I'll look at it again, once my concerns are all addressed. @Bhavana-Kilambi feel free to ping me again for that.

eme64 avatar May 08 '24 12:05 eme64

@eme64 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

openjdk[bot] avatar May 08 '24 12:05 openjdk[bot]

I just realized that there is no regression test. And I think it would be nice to have one.

Also, we should add some sort of message to the dump if the ReductionNode has the requires_strict_order on or off. I think that could be done in dump_spec.

You could do it similar to:

#ifndef PRODUCT
void VectorMaskCmpNode::dump_spec(outputStream *st) const {
  st->print(" %d #", _predicate); _type->dump_on(st);
}
#endif // PRODUCT

This would actually allow you to create a IR test!

You would check that the AddReductionVNode is annotated correctly. You need some VectorAPI tests, and some SuperWord auto-vectorization tests.

How does that sound? That would ensure that nobody can easily destroy your RFE, at least not in the IR.

Hi @eme64 , thanks for the suggestion. I can add the dump_spec as suggested (which would print if the _requires_strict_order flag is enabled/disabled) but I am not sure if I fully understand what's expected in the JTREG tests. Should I be verifying the -XX:+PrintIdeal output to make sure the correct message is being printed for the ReductionV* nodes?

Bhavana-Kilambi avatar May 08 '24 20:05 Bhavana-Kilambi

I am not sure if I fully understand what's expected in the JTREG tests. Should I be verifying the -XX:+PrintIdeal output to make sure the correct message is being printed for the ReductionV* nodes?

Yes, the IR framework basically does regex matching against the PrintIdeal graph. For example: counts = {IRNode.STORE_VECTOR, ">0"} in the @IR rule executes the regex for the store vector, and checks if we find more than zero occurances.

Maybe you can just use a regex string directly for your special IR rule. Alternatively, you could have them in the IRNode class, but not sure that's worth it.

eme64 avatar May 13 '24 05:05 eme64

@eme64 Thanks for the clarification. I understand the usage of counts in the IR tests. Just that I got a bit confused by some of your earlier statements. We do actually have a test to make sure AddReductionVF/VD and MulReductionVF/VD are not generated on aarch64 NEON machines - test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java. I can modify this test to include UseSVE > 0 case as well and will also add a separate JTREG test for the VectorAPI tests. Hope that's ok..

Bhavana-Kilambi avatar May 13 '24 10:05 Bhavana-Kilambi

@Bhavana-Kilambi I know we have the tests in test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java, and some other reduction tests. But these do not do the specific think I would like to see.

I would like this:

  • Add no_strict_order vs requires_strict_order or similar to dump_spec.
  • IR match not just that there is the correct ReductionNode, but also that it has the no_strict_order or requires_strict_order in its dump. You can do that by using a custom regex string, rather than IRNode.STORE_VECTOR or similar.
  • Then, create different tests, some where we expect ordered, some unordered vectors. Use Vector API and SuperWord examples.

Does that make sense?

eme64 avatar May 13 '24 11:05 eme64

@Bhavana-Kilambi I know we have the tests in test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java, and some other reduction tests. But these do not do the specific think I would like to see.

I would like this:

* Add `no_strict_order` vs `requires_strict_order` or similar to `dump_spec`.

* IR match not just that there is the correct `ReductionNode`, but also that it has the `no_strict_order` or `requires_strict_order` in its dump. You can do that by using a custom regex string, rather than `IRNode.STORE_VECTOR` or similar.

* Then, create different tests, some where we expect ordered, some unordered vectors. Use Vector API and SuperWord examples.

Does that make sense?

Yes, I am doing exactly that. Just that for the superword(auto-vec) case, I am just modifying the AddReduction related tests in TestDisableAutoVectOpcodes.java to incorporate the case with UseSVE > 0 as well and match the regex as per the dump_spec output.

Bhavana-Kilambi avatar May 13 '24 12:05 Bhavana-Kilambi