jdk
jdk copied to clipboard
8342692: C2: MemorySegment API slow with short running loops
To optimize a long counted loop and long range checks in a long or int counted loop, the loop is turned into a loop nest. When the loop has few iterations, the overhead of having an outer loop whose backedge is never taken, has a measurable cost. Furthermore, creating the loop nest usually causes one iteration of the loop to be peeled so predicates can be set up. If the loop is short running, then it's an extra iteration that's run with range checks (compared to an int counted loop with int range checks).
This change doesn't create a loop nest when:
1- it can be determined statically at loop nest creation time that the loop runs for a short enough number of iterations
2- profiling reports that the loop runs for no more than ShortLoopIter iterations (1000 by default).
For 2-, a guard is added which is implemented as yet another predicate.
While this change is in principle simple, I ran into a few implementation issues:
-
while c2 has a way to compute the number of iterations of an int counted loop, it doesn't have that for long counted loop. The existing logic for int counted loops promotes values to long to avoid overflows. I reworked it so it now works for both long and int counted loops.
-
I added a new deoptimization reason (Reason_short_running_loop) for the new predicate. Given the number of iterations is narrowed down by the predicate, the limit of the loop after transformation is a cast node that's control dependent on the short running loop predicate. Because once the counted loop is transformed, it is likely that range check predicates will be inserted and they will depend on the limit, the short running loop predicate has to be the one that's further away from the loop entry. Now it is also possible that the limit before transformation depends on a predicate (TestShortRunningLongCountedLoopPredicatesClone is an example), we can have: new predicates inserted after the transformation that depend on the casted limit that itself depend on old predicates added before the transformation. To solve this cicular dependency, parse and assert predicates are cloned between the old predicates and the loop head. The cloned short running loop parse predicate is the one that's used to insert the short running loop predicate.
-
In the case of a long counted loop, the loop is transformed into a regular loop with a new limit and transformed range checks that's later turned into an in counted loop. The int counted loop doesn't need loop limit checks because of the way it's constructed. There's an assert that catches that we don't attempt to add one. I ran into test failures where, by the time the int counted loop is created, the fact that the number of iterations of the loop is small enough to not need a loop limit check gets lost. I added a cast to make sure the narrowed limit's type is not lost (I had to do something similar for loop nests). But then, I ran into the same issue again because the cast was pushed through a sub or add and the narrowed type was lost. I propose that pushing casts through sub/add be only done after loop opts are over (same as what's done for range check
CastII).
On Maurizio's benchmark that's mentioned in the bug, this gives a ~30% performance increase. /cc hotspot-compiler
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8342692: C2: MemorySegment API slow with short running loops (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630
$ git checkout pull/21630
Update a local copy of the PR:
$ git checkout pull/21630
$ git pull https://git.openjdk.org/jdk.git pull/21630/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21630
View PR using the GUI difftool:
$ git pr show -t 21630
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21630.diff
Webrev
:wave: Welcome back roland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@rwestrel This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops
Co-authored-by: Maurizio Cimadamore <[email protected]>
Co-authored-by: Christian Hagedorn <[email protected]>
Reviewed-by: chagedorn, thartmann
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 171 new commits pushed to the master branch:
- 62a58062e5f3d0a723608d98d2412ea779f73897: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist
- 9609f57cef684d2f44d3e12a3522811a3c0776f4: 8361752: Double free in CompileQueue::delete_all after JDK-8357473
- 441dbde2c3c915ffd916e39a5b4a91df5620d7f3: 8362556: New test jdk/jfr/event/io/TestIOTopFrame.java is failing on all platforms
- ... and 168 more: https://git.openjdk.org/jdk/compare/310ef85667bdba3f984cb6327aee71cfaf91458b...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@rwestrel this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8342692
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
@rwestrel
The hotspot-compiler label was successfully added.
Webrevs
- 36: Full - Incremental (9cae7ead)
- 35: Full (bb69cc02)
- 34: Full - Incremental (f3ca08de)
- 33: Full (fd19ee84)
- 32: Full - Incremental (5ad3f22e)
- 31: Full - Incremental (b1da1b13)
- 30: Full - Incremental (43e8a14f)
- 29: Full - Incremental (42cd7091)
- 28: Full - Incremental (42f1ecfd)
- 27: Full - Incremental (d409deb4)
- 26: Full - Incremental (66e960aa)
- 25: Full - Incremental (af1d12bf)
- 24: Full - Incremental (e55393ee)
- 23: Full (6100f757)
- 22: Full (a0726ac7)
- 21: Full - Incremental (2164c15f)
- 20: Full - Incremental (b0129598)
- 19: Full - Incremental (223c9481)
- 18: Full - Incremental (1289e211)
- 17: Full - Incremental (9798bc97)
- 16: Full - Incremental (b56a2649)
- 15: Full (ed774a56)
- 14: Full (065abb29)
- 13: Full - Incremental (34571869)
- 12: Full (8877698d)
- 11: Full (f740b966)
- 10: Full (f68a7ca6)
- 09: Full (3df20871)
- 08: Full (7dd6fde9)
- 07: Full (0f137359)
- 06: Full (5e29f364)
- 05: Full - Incremental (aa567717)
- 04: Full (74c38342)
- 03: Full (32d8d630)
- 02: Full - Incremental (2deb4dba)
- 01: Full (d7cde041)
- 00: Full (2fde1723)
I didn't look at it yet but submitted some quick testing. The build on Mac AArch64 fails:
[2024-10-23T11:56:28,256Z] /System/Volumes/Data/mesos/work_dir/slaves/7a20d425-e769-4142-b5c1-e3cc2d88e03e-S37429/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/6cb59bf8-52bf-4698-bc7c-7bac27fa71af/runs/66922795-8c84-4b35-8612-4d25564e6c23/workspace/open/src/hotspot/share/opto/loopTransform.cpp:2069:69: error: format specifies type 'long' but the argument has type 'julong' (aka 'unsigned long long') [-Werror,-Wformat]
[2024-10-23T11:56:28,256Z] tty->print("Unroll %d(%2ld) ", loop_head->unrolled_count()*2, loop_head->trip_count());
[2024-10-23T11:56:28,256Z] ~~~~ ^~~~~~~~~~~~~~~~~~~~~~~
[2024-10-23T11:56:28,256Z] %2llu
[2024-10-23T11:56:28,256Z] /System/Volumes/Data/mesos/work_dir/slaves/7a20d425-e769-4142-b5c1-e3cc2d88e03e-S37429/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/6cb59bf8-52bf-4698-bc7c-7bac27fa71af/runs/66922795-8c84-4b35-8612-4d25564e6c23/workspace/open/src/hotspot/share/opto/loopTransform.cpp:2322:35: error: format specifies type 'long' but the argument has type 'julong' (aka 'unsigned long long') [-Werror,-Wformat]
[2024-10-23T11:56:28,256Z] tty->print("MaxUnroll %ld ", cl->trip_count());
[2024-10-23T11:56:28,256Z] ~~~ ^~~~~~~~~~~~~~~~
[2024-10-23T11:56:28,256Z] %llu
More failures:
compiler/loopopts/TestOverunrolling.java
-XX:-TieredCompilation -XX:+AlwaysIncrementalInline
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/workspace/open/src/hotspot/share/opto/loopnode.cpp:6195), pid=2648018, tid=2648035
# assert(!had_error) failed: bad dominance
#
# JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-10-23-1151312.tobias.hartmann.jdk4)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 24-internal-2024-10-23-1151312.tobias.hartmann.jdk4, compiled mode, sharing, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x12f07a7] PhaseIdealLoop::compute_lca_of_uses(Node*, Node*, bool)+0x927
Current CompileTask:
C2:19645 2268 b compiler.loopopts.TestOverunrolling::test3 (89 bytes)
Stack: [0x00007f59a4cee000,0x00007f59a4dee000], sp=0x00007f59a4de8ca0, free space=1003k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x12f07a7] PhaseIdealLoop::compute_lca_of_uses(Node*, Node*, bool)+0x927 (loopnode.cpp:6195)
V [libjvm.so+0x12f0b78] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x1d8 (loopnode.cpp:6610)
V [libjvm.so+0x12f1b20] PhaseIdealLoop::build_loop_late(VectorSet&, Node_List&, Node_Stack&)+0x190 (loopnode.cpp:6561)
V [libjvm.so+0x12f2938] PhaseIdealLoop::build_and_optimize()+0x6d8 (loopnode.cpp:4974)
V [libjvm.so+0xa356c5] PhaseIdealLoop::verify(PhaseIterGVN&)+0x3c5 (loopnode.hpp:1144)
V [libjvm.so+0xa303b3] Compile::Optimize()+0x743 (compile.cpp:2397)
V [libjvm.so+0xa34683] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1b23 (compile.cpp:852)
V [libjvm.so+0x87ee45] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5 (c2compiler.cpp:142)
V [libjvm.so+0xa40518] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928 (compileBroker.cpp:2303)
V [libjvm.so+0xa411a8] CompileBroker::compiler_thread_loop()+0x478 (compileBroker.cpp:1961)
V [libjvm.so+0xef10fc] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:759)
V [libjvm.so+0x181dad6] Thread::call_run()+0xb6 (thread.cpp:234)
V [libjvm.so+0x14ff5b8] thread_native_entry(Thread*)+0x128 (os_linux.cpp:858)
compiler/loopopts/superword/TestMemorySegment.java
Failed IR Rules (6) of Methods (6)
----------------------------------
1) Method "static java.lang.Object[] compiler.loopopts.superword.TestMemorySegmentImpl.testIntLoop_longIndex_intInvar_sameAdr_byte(java.lang.foreign.MemorySegment,int)" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true"}, counts={"_#V#LOAD_VECTOR_B#_", "> 0", "_#V#ADD_VB#_", "> 0", "_#STORE_VECTOR#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "(\\d+(\\s){2}(AddVB.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 3: "(\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
2) Method "static java.lang.Object[] compiler.loopopts.superword.TestMemorySegmentImpl.testIntLoop_longIndex_intInvar_sameAdr_int(java.lang.foreign.MemorySegment,int)" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true"}, counts={"_#V#LOAD_VECTOR_I#_", "> 0", "_#V#ADD_VI#_", "> 0", "_#STORE_VECTOR#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={"AlignVector", "false"}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z]<I,8>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "(\\d+(\\s){2}(AddVI.*)+(\\s){2}===.*vector[A-Za-z]<I,8>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 3: "(\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
3) Method "static java.lang.Object[] compiler.loopopts.superword.TestMemorySegmentImpl.testIntLoop_longIndex_longInvar_sameAdr_byte(java.lang.foreign.MemorySegment,long)" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true"}, counts={"_#V#LOAD_VECTOR_B#_", "> 0", "_#V#ADD_VB#_", "> 0", "_#STORE_VECTOR#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "(\\d+(\\s){2}(AddVB.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 3: "(\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
4) Method "static java.lang.Object[] compiler.loopopts.superword.TestMemorySegmentImpl.testIntLoop_longIndex_longInvar_sameAdr_int(java.lang.foreign.MemorySegment,long)" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true"}, counts={"_#V#LOAD_VECTOR_I#_", "> 0", "_#V#ADD_VI#_", "> 0", "_#STORE_VECTOR#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={"AlignVector", "false"}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z]<I,8>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "(\\d+(\\s){2}(AddVI.*)+(\\s){2}===.*vector[A-Za-z]<I,8>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 3: "(\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
5) Method "static java.lang.Object[] compiler.loopopts.superword.TestMemorySegmentImpl.testLongLoop_longIndex_intInvar_sameAdr_byte(java.lang.foreign.MemorySegment,int)" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true"}, counts={"_#V#LOAD_VECTOR_B#_", "> 0", "_#V#ADD_VB#_", "> 0", "_#STORE_VECTOR#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "(\\d+(\\s){2}(AddVB.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 3: "(\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
6) Method "static java.lang.Object[] compiler.loopopts.superword.TestMemorySegmentImpl.testLongLoop_longIndex_longInvar_sameAdr_byte(java.lang.foreign.MemorySegment,long)" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true"}, counts={"_#V#LOAD_VECTOR_B#_", "> 0", "_#V#ADD_VB#_", "> 0", "_#STORE_VECTOR#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "(\\d+(\\s){2}(AddVB.*)+(\\s){2}===.*vector[A-Za-z]<B,32>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 3: "(\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
compiler/predicates/TestAssertionPredicateDoesntConstantFold.java
-XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:+StressArrayCopyMacroNode -XX:+StressLCM -XX:+StressGCM -XX:+StressIGVN -XX:+StressCCP -XX:+StressMacroExpansion -XX:+StressMethodHandleLinkerInlining -XX:+StressCompiledExceptionHandlers
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/workspace/open/src/hotspot/share/opto/loopopts.cpp:1739), pid=2300415, tid=2300431
# assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect
#
# JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-10-23-1151312.tobias.hartmann.jdk4)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 24-internal-2024-10-23-1151312.tobias.hartmann.jdk4, compiled mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x1300f90] PhaseIdealLoop::try_sink_out_of_loop(Node*) [clone .part.0]+0xbd0
Current CompileTask:
C2:157 14 b TestAssertionPredicateDoesntConstantFold::test (69 bytes)
Stack: [0x00007f502acee000,0x00007f502adee000], sp=0x00007f502ade8cf0, free space=1003k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1300f90] PhaseIdealLoop::try_sink_out_of_loop(Node*) [clone .part.0]+0xbd0 (loopopts.cpp:1739)
V [libjvm.so+0x1301148] PhaseIdealLoop::split_if_with_blocks_post(Node*)+0x98 (loopopts.cpp:1706)
V [libjvm.so+0x13019fa] PhaseIdealLoop::split_if_with_blocks(VectorSet&, Node_Stack&)+0x9a (loopopts.cpp:1986)
V [libjvm.so+0x12f338b] PhaseIdealLoop::build_and_optimize()+0x112b (loopnode.cpp:5086)
V [libjvm.so+0xa36958] PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x3a8 (loopnode.hpp:1129)
V [libjvm.so+0xa2f9a4] Compile::optimize_loops(PhaseIterGVN&, LoopOptsMode)+0x74 (compile.cpp:2179)
V [libjvm.so+0xa3072c] Compile::Optimize()+0xabc (compile.cpp:2426)
V [libjvm.so+0xa34683] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1b23 (compile.cpp:852)
V [libjvm.so+0x87ee45] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5 (c2compiler.cpp:142)
V [libjvm.so+0xa40518] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928 (compileBroker.cpp:2303)
V [libjvm.so+0xa411a8] CompileBroker::compiler_thread_loop()+0x478 (compileBroker.cpp:1961)
V [libjvm.so+0xef10fc] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:759)
V [libjvm.so+0x181dad6] Thread::call_run()+0xb6 (thread.cpp:234)
V [libjvm.so+0x14ff5b8] thread_native_entry(Thread*)+0x128 (os_linux.cpp:858)
serviceability/sa/ClhsdbCDSCore.java
-Duse.JTREG_TEST_THREAD_FACTORY=Virtual -XX:+UseZGC -XX:-ZGenerational -XX:-VerifyContinuations
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007effa39fa616, pid=1722549, tid=1722585
#
# JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-10-23-1151312.tobias.hartmann.jdk4)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 24-internal-2024-10-23-1151312.tobias.hartmann.jdk4, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x1874616] Unsafe_PutInt+0x106
Stack: [0x00007efd6e325000,0x00007efd6e425000], sp=0x00007efd6e422c60, free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1874616] Unsafe_PutInt+0x106 (unsafe.cpp:251)
j jdk.internal.misc.Unsafe.putInt(Ljava/lang/Object;JI)V+0 java.base@24-internal
j jdk.internal.misc.Unsafe.putInt(JI)V+4 java.base@24-internal
j CrashApp.main([Ljava/lang/String;)V+5
j java.lang.invoke.LambdaForm$DMH+0x00007efd024123e0.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@24-internal
j java.lang.invoke.LambdaForm$MH+0x00007efd02414c10.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@24-internal
j java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@24-internal
j jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@24-internal
j jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@24-internal
j java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@24-internal
j jdk.test.lib.process.ProcessTools.lambda$main$0(Ljava/lang/reflect/Method;[Ljava/lang/String;Ljdk/test/lib/process/ProcessTools$MainThreadGroup;)V+10
j jdk.test.lib.process.ProcessTools$$Lambda+0x00007efd070016e0.run()V+12
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@24-internal
j java.lang.VirtualThread.run(Ljava/lang/Runnable;)V+66 java.base@24-internal
j java.lang.VirtualThread$VThreadContinuation$1.run()V+8 java.base@24-internal
j jdk.internal.vm.Continuation.enter0()V+4 java.base@24-internal
j jdk.internal.vm.Continuation.enter(Ljdk/internal/vm/Continuation;Z)V+1 java.base@24-internal
J 96 jdk.internal.vm.Continuation.enterSpecial(Ljdk/internal/vm/Continuation;ZZ)V java.base@24-internal (0 bytes) @ 0x00007eff8c2d9ea4 [0x00007eff8c2d9d40+0x0000000000000164]
j jdk.internal.vm.Continuation.run()V+122 java.base@24-internal
j java.lang.VirtualThread.runContinuation()V+72 java.base@24-internal
j java.lang.VirtualThread$$Lambda+0x00007efd07047610.run()V+4 java.base@24-internal
j java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute()Ljava/lang/Void;+4 java.base@24-internal
[...]
compiler/c2/irTests/TestLongRangeChecks.java
-XX:UseAVX=0 -XX:UseSSE=2
Various IR verification failures
compiler/escapeAnalysis/TestMissingAntiDependency.java
-XX:StressLongCountedLoop=200000000
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/workspace/open/src/hotspot/share/opto/gcm.cpp:904), pid=2322548, tid=2322560
# assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found
#
# JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-10-23-1149160.tobias.hartmann.jdk4)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 24-internal-2024-10-23-1149160.tobias.hartmann.jdk4, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xdb3ffe] PhaseCFG::insert_anti_dependences(Block*, Node*, bool)+0x232e
#
Current CompileTask:
C2:285 92 b 4 TestMissingAntiDependency::test (89 bytes)
Stack: [0x00007f34579fb000,0x00007f3457afb000], sp=0x00007f3457af6690, free space=1005k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xdb3ffe] PhaseCFG::insert_anti_dependences(Block*, Node*, bool)+0x232e (gcm.cpp:904)
V [libjvm.so+0xdba0b6] PhaseCFG::schedule_late(VectorSet&, Node_Stack&)+0xa16 (gcm.cpp:1521)
V [libjvm.so+0xdbaa0f] PhaseCFG::global_code_motion()+0x3ef (gcm.cpp:1632)
V [libjvm.so+0xdbd826] PhaseCFG::do_global_code_motion()+0x66 (gcm.cpp:1755)
V [libjvm.so+0xa318d3] Compile::Code_Gen()+0x3c3 (compile.cpp:2960)
V [libjvm.so+0xa347a0] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c40 (compile.cpp:885)
V [libjvm.so+0x87ee45] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5 (c2compiler.cpp:142)
V [libjvm.so+0xa40518] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928 (compileBroker.cpp:2303)
V [libjvm.so+0xa411a8] CompileBroker::compiler_thread_loop()+0x478 (compileBroker.cpp:1961)
V [libjvm.so+0xef10fc] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:759)
V [libjvm.so+0x181dad6] Thread::call_run()+0xb6 (thread.cpp:234)
V [libjvm.so+0x14ff5b8] thread_native_entry(Thread*)+0x128 (os_linux.cpp:858)
Thanks @TobiHartmann for the test results. For the failure in compiler/escapeAnalysis/TestMissingAntiDependency.java I already filed JDK-8341976. I will work on the other ones.
@rwestrel This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
I pushed an update that should fix all test failures except the one in compiler/escapeAnalysis/TestMissingAntiDependency.java (covered by JDK-8341976). A lot of them were caused by the following part of the change:
In the case of a long counted loop, the loop is transformed into a regular loop with a new limit and transformed range checks that's later turned into an in counted loop. The int counted loop doesn't need loop limit checks because of the way it's constructed. There's an assert that catches that we don't attempt to add one. I ran into test failures where, by the time the int counted loop is created, the fact that the number of iterations of the loop is small enough to not need a loop limit check gets lost. I added a cast to make sure the narrowed limit's type is not lost (I had to do something similar for loop nests). But then, I ran into the same issue again because the cast was pushed through a sub or add and the narrowed type was lost. I propose that pushing casts through sub/add be only done after loop opts are over (same as what's done for range check CastII).
So I removed that part of the initial change and instead added some logic to pattern match the CastLL used by the loop nest for which the transformation of (CastLL (AddL ...)) shouldn't be performed until the inner loop is turned into a counted loop.
@rwestrel Would you mind changing the title to something more descriptive of your change? I'm thinking: "C2: Don't create loop-nest for short running loops".
@rwestrel Would you mind changing the title to something more descriptive of your change? I'm thinking: "C2: Don't create loop-nest for short running loops".
Done.
Which benchmarks are you referring to?
The one mentioned in the bug: https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:manual_mismatch_bench?expand=1
@rwestrel it would be nice to see a plot like this, with the benchmark results: X-axis: increasing loop iterations Y-axis: time
Similar to what I did here: https://github.com/openjdk/jdk/pull/22070
You could go over loop sizes 500-2000 in steps of 100, just to get a rough sense if your constant threshold of 1000 is roughly right.
Maybe you can even extend the benchmark I wrote there, with MemorySegment cases. That would be useful also for the other efforts where we are working on short running loops:
JDK-8307084: C2: Vector atomic post loop is not executed for some small trip counts
JDK-8344085: C2 SuperWord: improve vectorization for small loop iteration count
I just linked these two issues with this RFE on JBS.
FTR, testing is all clean now.
I'm worried this solution will be susceptible to profile pollution, i.e. if a loop is called from 2 places, one with a large trip count and one with a small trip count, then the caller with a small trip count will suffer. As a result, in addition to this, making loop nests with 1 iteration of the outer loop cheaper will also be necessary. Profile pollution seems to be an issue that leads to JDK-8307084, too.
I'm worried this solution will be susceptible to profile pollution, i.e. if a loop is called from 2 places, one with a large trip count and one with a small trip count, then the caller with a small trip count will suffer. As a result, in addition to this, making loop nests with 1 iteration of the outer loop cheaper will also be necessary. Profile pollution seems to be an issue that leads to JDK-8307084, too.
With the current patch, a loop is short running if it has fewer than ShortLoopIter because restricting the number of iterations that much makes the outer strip mined loop go away as well. We could give up on that and consider that all loops that run for no more than roughly max_jint/max RC scale is short running. That would realistically cover a lot of loops (loops with millions of iterations that is most of them?) and would help mitigate that problem.
I'm also looking at ways to make range checks (once hoisted out of loop by predication) cheaper to compute as the current RC expressions end up being quite convoluted.
Do you have anything in mind?
I didn't conduct extensive testing as you did so I don't know the threshold below which the overhead from the loop nest is significant. I think when it is significant then the trip count should be very low and the loop body be very small. As a result, when the loop body is small enough, you can clone the loop and create the loop nest for one of them. As the other should have a very low trip count it should stay low in size. Am I correct?
As a result, when the loop body is small enough, you can clone the loop and create the loop nest for one of them.
So we could do multiversioning. I'm currently experimenting with runtime-checks for Aliasing Analysis in SuperWord, so I'll use both a trap and a multiversioning approach. Just FYI, in case we end up both doing this we should come up with a common way to do multiversioning.
what about this part of my comment:
consider that all loops that run for no more than roughly max_jint/max RC scale is short running. That would realistically cover a lot of loops (loops with millions of iterations that is most of them?)
Do we expect loops that run for millions of iterations (or use unusual scale factors in RC) to be that common? That would be much simpler than multiversioning.
Do we expect loops that run for millions of iterations (or use unusual scale factors in RC) to be that common? That would be much simpler than multiversioning.
I agree, that should be adequate and simpler.
@rwestrel This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Performance of Maurizio's micro benchmark: red is without this patch, green is with the just updated version of the patch (which treats all loops that execute for less than max_jint/max RC scale as short running), blue is the previous version of the patch (short running loops run for less than ShortLoopIter iterations).
Previous patch performs slightly better than current one because restricting the number of iterations for a short running loop more also allows removing the outer strip mined loop but it has the drawback that the transformation may applicable to fewer loops.
Can you inject the iteration count into the created loop so that it can avoid strip mining?
Can you inject the iteration count into the created loop so that it can avoid strip mining?
Some loops that are not covered by this change would likely benefit from not being strip mined if possible. So I think it would be better to address that separately (and make sure it plays well with this change).
There are some failures with compiler/loopopts/superword/TestMemorySegment.java in the test runs. Given https://github.com/openjdk/jdk/pull/21926 makes some changes that affect that test, I think it's easier to ignore those failures for now and revisit once 21926 integrates. AFAICT TestMemorySegment.java will need some small adjustments then.
https://github.com/openjdk/jdk/pull/21926 is in now. Should I submit testing?
I tweaked compiler/loopopts/superword/TestMemorySegment.java: a couple more tests pass now if ShortRunningLongLoop is true.
#21926 is in now. Should I submit testing?
Yes, please.
compiler/escapeAnalysis/TestMissingAntiDependency.java fails on Windows x64 and Linux AArch64 with -XX:StressLongCountedLoop=200000000:
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (workspace\open\src\hotspot\share\opto\gcm.cpp:916), pid=35968, tid=34752
# assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found
#
Current CompileTask:
C2:710 98 b 4 TestMissingAntiDependency::test (89 bytes)
Stack: [0x0000007bdcb00000,0x0000007bdcc00000], sp=0x0000007bdcbfbba0, free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [jvm.dll+0x7d2910] PhaseCFG::insert_anti_dependences+0xe30 (gcm.cpp:916)
V [jvm.dll+0x7d591f] PhaseCFG::schedule_late+0x47f (gcm.cpp:1536)
V [jvm.dll+0x7d083e] PhaseCFG::global_code_motion+0x31e (gcm.cpp:1650)
V [jvm.dll+0x7cf2ad] PhaseCFG::do_global_code_motion+0x6d (gcm.cpp:1780)
V [jvm.dll+0x55746d] Compile::Code_Gen+0x19d (compile.cpp:2953)
V [jvm.dll+0x555ca0] Compile::Compile+0x11d0 (compile.cpp:882)
V [jvm.dll+0x45cfd9] C2Compiler::compile_method+0x179 (c2compiler.cpp:144)
V [jvm.dll+0x573a5a] CompileBroker::invoke_compiler_on_method+0x7aa (compileBroker.cpp:2317)
V [jvm.dll+0x570fab] CompileBroker::compiler_thread_loop+0x33b (compileBroker.cpp:1976)
V [jvm.dll+0x8ba602] JavaThread::thread_main_inner+0x282 (javaThread.cpp:777)
V [jvm.dll+0xfa95f4] Thread::call_run+0x1b4 (thread.cpp:236)
V [jvm.dll+0xd6ae91] thread_native_entry+0xe1 (os_windows.cpp:566)
C [ucrtbase.dll+0x2268a] (no source info available)
C [KERNEL32.DLL+0x17ac4] (no source info available)
C [ntdll.dll+0x5a8c1] (no source info available)
Maybe it's (related to) JDK-8341976?