jdk
jdk copied to clipboard
8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction
Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.
To avoid introducing new nodes, this patch adds a bool field in AddReductionVF/D to distinguish whether they require strict order. It also removes UnorderedReductionNode and adds a virtual function bool requires_strict_order() in ReductionNode. Besides AddReductionVF/D, other reduction nodes' requires_strict_order() have a fixed value.
With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.
[AArch64]
On Neon, non strictly-ordered AddReductionVF/D cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].
This patch adds matching rules for non strictly-ordered AddReductionVF/D.
No effects on other platforms.
[Performance] FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).
ADDLanes
Benchmark Before After Unit
FloatMaxVector.ADDLanes 1789.513 5264.226 ops/ms
Final code is as below:
Before:
fadda z17.s, p7/m, z17.s, z16.s
After:
faddp v17.4s, v21.4s, v21.4s
faddp s18, v17.2s
fadd s18, s18, s19
[Test] Full jtreg passed on AArch64 and x86.
[1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529 [2] https://bugs.openjdk.org/browse/JDK-8275275 [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316
Progress
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
- [x] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)
Integration blocker
⚠️ Title mismatch between PR and JBS for issue JDK-8320725
Issue
- JDK-8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add and mul reduction (Enhancement - P4) ⚠️ Title mismatch between PR and JBS.
Reviewers
- Guoxiong Li (@lgxbslgx - Reviewer) ⚠️ Review applies to f8f79ac2
- Andrew Haley (@theRealAph - Reviewer) ⚠️ Review applies to bdd0fabf
Contributors
- Eric Liu
<[email protected]>
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034
$ git checkout pull/18034
Update a local copy of the PR:
$ git checkout pull/18034
$ git pull https://git.openjdk.org/jdk.git pull/18034/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18034
View PR using the GUI difftool:
$ git pr show -t 18034
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18034.diff
Webrev
:wave: Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@Bhavana-Kilambi The following label will be automatically applied to this pull request:
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
/contributor add @e1iu
@Bhavana-Kilambi
Contributor Eric Liu <[email protected]> successfully added.
Webrevs
- 11: Full - Incremental (35e6258d)
- 10: Full - Incremental (db88e3c9)
- 09: Full - Incremental (b8f6cfb5)
- 08: Full - Incremental (3afde82c)
- 07: Full - Incremental (bdd0fabf)
- 06: Full - Incremental (6d25d78f)
- 05: Full - Incremental (f38dae21)
- 04: Full - Incremental (71a86deb)
- 03: Full - Incremental (1156ef39)
- 02: Full - Incremental (4aed4b50)
- 01: Full - Incremental (f8f79ac2)
- 00: Full (f8492ece)
Before this patch, the
ReductionNodeis strict-ordered and theUnorderedReductionNodeis unordered. If we want an unordered reduction node, we just extendsUnorderedReductionNodeinstead ofReductionNode. Now you think every concrete node object should has the right to judge whether it is strict-ordered or not by itself (using the methodrequires_strict_order). If my understanding is right, this patch is a nice refactor. Some concrete comments are shown below.
Hi, thanks for your review. You are correct. With this patch, all reduction type nodes (whether ordered or unordered) can decide if they are ordered/unordered based on 'requires_strict_order' method. In addition to that, they can also define a field - 'requires_strict_order' like in the floating-point add reduction node definitions in case there are multiple paths where a specific node can be either ordered or unordered (instead of returning a fixed true/false value). I will make changes you suggested and update patch accordingly. Thanks again!
Can I please ask for more reviews for this PR? Thank you in advance!
Looks interesting! Will have a look at it. /reviewers 2 reviewer
@eme64 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).
A suggestion about naming: We now have a few synonyms:
Unordered reduction
non-strict order reduction
associative reduction
I think I introduced the "unordered" one. Not proud of it any more. I think we should probably use (non) associative everywhere. That is the technical/mathematical term. We can use synonyms in the comments to make the explanation more clear though.
On a more visionary note:
We should make sure that the actual ReductionNode gets moved out of the loop, when possible.
JDK-8309647 [Vector API] Move Reduction outside loop when possible
We have an RFE for that, I have not yet have time or priority for it.
Of course the user can already move it out of the loop themselves.
If the ReductionNode is out of the loop, then you usually just have a very cheap accumulation inside the loop, a MulVF for example. That would certainly be cheap enough to allow vectorization.
So in that case, your optimization here should not just affect SVE, but also NEON and x86.
Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.
What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.
@eme64 Thank you so much for your review and feedback comments. Here are my responses to your questions -
> So in that case, your optimization here should not just affect SVE, but also NEON and x86. From what I understand, even when the reduction nodes are hoisted out of the loop, it would still generate AddReductionVF/VD nodes (fewer as we accumulate inside the loop now) and based on the choice of order the corresponding backend match rules (as included in this patch) should generate the expected instruction sequence. I don't think we would need any code changes for Neon/SVE after hoisting the reductions out of the loop. Please let me know if my understanding is incorrect.
> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here? I am unable to tag them for some reason.
> What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE. To give a background : We observed a significant performance degradation with SVE instructions compared to Neon for this testcase - FloatMaxVector.ADDLanes on a 128-bit SVE machine. It generates the SVE "fadda" instruction which is a strictly-ordered floating-point add reduction instruction. As it has a higher cost compared to the Neon implementation for FP add reduction, the performance with "fadda" was ~66% worse compared to Neon. As VectorAPI does not impose any rules on FP ordering, it could have generated the faster non-strict Neon instructions instead (on a 128-bit SVE machine). The reason we included a flag "requires_strict_order" to mark a reduction node as strictly-ordered or non-strictly ordered and generate the corresponding backend instructions. On aarch64, this patch only affects the 128-bit SVE machines. On SVE machines >128bits, the "fadda" instruction is generated as it was before this patch. There's no change on Neon as well - the non-strict Neon instructions are generated with VectorAPI and no auto-vectorization is allowed for FP reduction nodes. Although this change was done keeping SVE in mind, this patch can help generate strictly ordered or non-strictly ordered code on other platforms as well (if they have different implementations for both) and also simplifies the IdealGraph a bit by removing the UnorderedReductionNode. I will make changes regarding the naming in the patch in the next commit. Thanks.
@Bhavana-Kilambi This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add and mul reduction
Co-authored-by: Eric Liu <[email protected]>
Reviewed-by: gli, epeter, aph
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 16 new commits pushed to the master branch:
- 83b34410e326c47f357a37c3a337b7dedb8cbbda: 8322811: jcmd System.dump_map help info has conflicting statements
- 8aa35cacfcc94d261de102b628eb954c71eae98e: 8333833: Remove the use of ByteArrayLittleEndian from UUID::toString
- de55db2352f84c101f8197ee7aca80d72807fbc5: 8333522: JFR SwapSpace event might read wrong free swap space size
- a941397327972f130e683167a1b429f17603df46: 8329031: CPUID feature detection for Advanced Performance Extensions (Intel® APX)
- 8d2f9e57c3797c01c84df007f4d2bfdcd645d0c0: 8333749: Consolidate ConstantDesc conversion in java.base
- a6fc2f839a5e494b940ee473cbd942ec5f884324: 8333412: [s390x] Add support for branch on count instruction
- cf677c901e70d98404ec9cc3d75a93926e02fcd2: 8333823: Update --release 23 symbol information for JDK 23 build 26
- 18e7d7b5e710b24e49b995777906a197e35795e6: 8333716: Shenandoah: Check for disarmed method before taking the nmethod lock
- c37d02aef38da178fcf56e3c5cccc41cc5175421: 8312412: Uninitialized klassVtable::_verify_count field
- 17bd483ff01e463cef45824f0c1296a8f3e782c8: 8333680: com/sun/tools/attach/BasicTests.java fails with "SocketException: Permission denied: connect"
- ... and 6 more: https://git.openjdk.org/jdk/compare/486dee2cf420981b4c8111c24c5fbd27aceb238b...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@lgxbslgx, @eme64, @theRealAph) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).
There seem to be a couple of failures. Linux cross-compile fails but this seems to be more of a build problem than something caused by my patch. There's one JTREG test failure on x86 - tools/javac/patterns/Exhaustiveness.java which seemed to fail due to - Agent error: java.lang.Exception: Agent 8 timed out with a timeout of 480 seconds; I re-ran this specific test manually on an x86 machine and this test passed.
> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions. Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. I could have tried but I am not very well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required). Maybe Jatin/Sandhya can help here? I am unable to tag them for some reason.
I agree, as per following Vector API documentation, backends are free to deviate from JVM specification (15.18.2) which enforces non-associativity on FP operation. https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc:~:text=Certain%20associative%20operations,this%20machine%20code. You may create a follow-up RFE for x86 side of optimization and assign it to me.
Thank you Jatin. I will do that once this PR is approved and merged upstream.
Please rename this to "8320725: AArch64: C2: Add "is_associative" flag for floating-point add-reduction"
You probably want to change the name of the PR again:
Add "is_associative" flag for floating-point add-reduction -> 8320725: AArch64: C2: Add "requires_strict_order" flag for floating-point add-reduction
@eme64 Thank you for the notification and your review ! I did change the title in the bug but forgot to do that in the PR as well. Changed that now. I will make the changes you suggested in the next PS.
I see that this test - "jdk/java/util/HashMap/WhiteBoxResizeTest.java" seems to have failed on an x86 machine. I have tested it on my local x86 machine and the test passes.
Hi @eme64 @theRealAph I have uploaded the latest patch addressing all review comments. Can I please ask for more reviews. Thank you ..
Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)
Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)
Thanks for the review. I will update my patch soon with your suggestions. Apologies for not making changes in the test* directory regarding the UnorderedReduction node which is now deleted but some tests seem to exist.
/reviewers 2 reviewer I'll look at it again, once my concerns are all addressed. @Bhavana-Kilambi feel free to ping me again for that.
@eme64 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).
I just realized that there is no regression test. And I think it would be nice to have one.
Also, we should add some sort of message to the
dumpif theReductionNodehas therequires_strict_orderon or off. I think that could be done indump_spec.You could do it similar to:
#ifndef PRODUCT void VectorMaskCmpNode::dump_spec(outputStream *st) const { st->print(" %d #", _predicate); _type->dump_on(st); } #endif // PRODUCTThis would actually allow you to create a IR test!
You would check that the AddReductionVNode is annotated correctly. You need some VectorAPI tests, and some SuperWord auto-vectorization tests.
How does that sound? That would ensure that nobody can easily destroy your RFE, at least not in the IR.
Hi @eme64 , thanks for the suggestion. I can add the dump_spec as suggested (which would print if the _requires_strict_order flag is enabled/disabled) but I am not sure if I fully understand what's expected in the JTREG tests. Should I be verifying the -XX:+PrintIdeal output to make sure the correct message is being printed for the ReductionV* nodes?
I am not sure if I fully understand what's expected in the JTREG tests. Should I be verifying the -XX:+PrintIdeal output to make sure the correct message is being printed for the ReductionV* nodes?
Yes, the IR framework basically does regex matching against the PrintIdeal graph. For example: counts = {IRNode.STORE_VECTOR, ">0"} in the @IR rule executes the regex for the store vector, and checks if we find more than zero occurances.
Maybe you can just use a regex string directly for your special IR rule. Alternatively, you could have them in the IRNode class, but not sure that's worth it.
@eme64 Thanks for the clarification. I understand the usage of counts in the IR tests. Just that I got a bit confused by some of your earlier statements. We do actually have a test to make sure AddReductionVF/VD and MulReductionVF/VD are not generated on aarch64 NEON machines - test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java. I can modify this test to include UseSVE > 0 case as well and will also add a separate JTREG test for the VectorAPI tests. Hope that's ok..
@Bhavana-Kilambi
I know we have the tests in test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java, and some other reduction tests. But these do not do the specific think I would like to see.
I would like this:
- Add
no_strict_ordervsrequires_strict_orderor similar todump_spec. - IR match not just that there is the correct
ReductionNode, but also that it has theno_strict_orderorrequires_strict_orderin its dump. You can do that by using a custom regex string, rather thanIRNode.STORE_VECTORor similar. - Then, create different tests, some where we expect ordered, some unordered vectors. Use Vector API and SuperWord examples.
Does that make sense?
@Bhavana-Kilambi I know we have the tests in
test/hotspot/jtreg/compiler/c2/irTests/TestDisableAutoVectOpcodes.java, and some other reduction tests. But these do not do the specific think I would like to see.I would like this:
* Add `no_strict_order` vs `requires_strict_order` or similar to `dump_spec`. * IR match not just that there is the correct `ReductionNode`, but also that it has the `no_strict_order` or `requires_strict_order` in its dump. You can do that by using a custom regex string, rather than `IRNode.STORE_VECTOR` or similar. * Then, create different tests, some where we expect ordered, some unordered vectors. Use Vector API and SuperWord examples.Does that make sense?
Yes, I am doing exactly that. Just that for the superword(auto-vec) case, I am just modifying the AddReduction related tests in TestDisableAutoVectOpcodes.java to incorporate the case with UseSVE > 0 as well and match the regex as per the dump_spec output.