jdk
jdk copied to clipboard
8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays
Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures.
The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64:
$ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1"
Benchmark (size) Mode Cnt Score Error Units
ArrayClone.byteArraycopy 0 avgt 15 3.476 ? 0.018 ns/op
ArrayClone.byteArraycopy 10 avgt 15 3.740 ? 0.017 ns/op
ArrayClone.byteArraycopy 100 avgt 15 7.124 ? 0.010 ns/op
ArrayClone.byteArraycopy 1000 avgt 15 39.301 ? 0.106 ns/op
ArrayClone.byteClone 0 avgt 15 3.478 ? 0.008 ns/op
ArrayClone.byteClone 10 avgt 15 3.562 ? 0.007 ns/op
ArrayClone.byteClone 100 avgt 15 5.888 ? 0.206 ns/op
ArrayClone.byteClone 1000 avgt 15 25.762 ? 0.203 ns/op
ArrayClone.intArraycopy 0 avgt 15 3.199 ? 0.016 ns/op
ArrayClone.intArraycopy 10 avgt 15 4.521 ? 0.008 ns/op
ArrayClone.intArraycopy 100 avgt 15 17.429 ? 0.039 ns/op
ArrayClone.intArraycopy 1000 avgt 15 178.432 ? 0.777 ns/op
ArrayClone.intClone 0 avgt 15 3.406 ? 0.016 ns/op
ArrayClone.intClone 10 avgt 15 4.272 ? 0.006 ns/op
ArrayClone.intClone 100 avgt 15 13.110 ? 0.122 ns/op
ArrayClone.intClone 1000 avgt 15 113.196 ? 13.400 ns/op
It also includes an optimization to avoid instantiating the array copy stub in scenarios like this.
I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g.
$ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1"
...
TEST TOTAL PASS FAIL ERROR
jtreg:test/hotspot/jtreg:hotspot_compiler 1234 1234 0 0
One question I had is what to do about non-primitive object arrays, see my question on the issue. @cl4es any thoughts?
Thanks @rwestrel for his help shaping this up :)
Progress
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
- [x] Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)
Issue
- JDK-8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays (Enhancement - P4)
Reviewers
- Dean Long (@dean-long - Reviewer) ⚠️ Review applies to 3af9cd9e
- Roland Westrelin (@rwestrel - Reviewer) ⚠️ Review applies to c3b7fa47
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17667/head:pull/17667
$ git checkout pull/17667
Update a local copy of the PR:
$ git checkout pull/17667
$ git pull https://git.openjdk.org/jdk.git pull/17667/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 17667
View PR using the GUI difftool:
$ git pr show -t 17667
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17667.diff
Webrev
:wave: Welcome back galder! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@galderz The following label will be automatically applied to this pull request:
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 16: Full - Incremental (09408587)
- 15: Full - Incremental (c3b7fa47)
- 14: Full - Incremental (a35cdd84)
- 13: Full - Incremental (306db745)
- 12: Full - Incremental (9376e9ec)
- 11: Full - Incremental (f1f6edd0)
- 10: Full - Incremental (595d1e99)
- 09: Full - Incremental (2d8854d0)
- 08: Full - Incremental (ad6c51bf)
- 07: Full - Incremental (3af9cd9e)
- 06: Full (b7ff1e6c)
- 05: Full (f7285799)
- 04: Full - Incremental (16346205)
- 03: Full - Incremental (b4db57cf)
- 02: Full - Incremental (c9215e7d)
- 01: Full - Incremental (2fa8602f)
- 00: Full (36185c4b)
This looks very much like what I had in mind: implement a C1 clone intrinsic by reusing existing arraycopy code as much as possible while not going through too much trouble to do so. While typically the benefits will only be to startup/warmup, the OpenJDK itself and many others do run short running processes with -XX:TieredStopAtLevel=1 where this might result in a significant overall speed-up. Nice work!
Sorry for erroneously referencing C2's ArrayCopyNode in the RFE summary. Feel free to rephrase this as "Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays"
Regarding your question on object arrays then I have no strong opinion except that if it's much more work then it's probably not worth it. It's less used than the primitive clones, e.g. java.util.Array.copyOfRange(T[], int, int, Class<? extends T[]>) does not use clone unlike its primitive array relatives. Recent changes to such APIs made use of primitive array clone more prominent, which in turn made the C1 cost difference between clone and arraycopy more apparent.
Thanks @cl4es for the feedback and the additional information regarding primitive vs object array cloning. Happy to leave it at primitives. I've amended the issue description.
Rather than create a new Clone op that fuses NewTypeArray and ArrayCopy, but requires new platform-specific code, why can't we just build clone on top of the existing NewTypeArray and ArrayCopy? We just need the new flag to avoid zeroing.
@dean-long I hadn't considered that but I will definitely look into it. I see there's a c1 type for NewTypeArray but I don't see any for array copy. Do you have any pointers/examples on how to approach a solution like the one you suggest?
@galderz, let me see if I can get it to work.
I think the right solution would be to add a line in GraphBuilder::build_graph_for_intrinsic for _clone, to append the IR for NewTypeArray and ArrayCopy as if we parsed newarray and arraycopy() from bytecodes. I'll see if I can get that working tomorrow.
I've added a couple of commits to fine tune the current approach (I will still explore the other option Dean mentioned):
The first commit makes sure the clone intrinsic handles deoptimizations appropiately. compiler/interpreter/Test6833129 was failing because of this.
I've also added a commit to add a null checks for the array being cloned. This also includes a jtreg test for the scenario in which the array was null after C1 compilation. This was resulting in a segv that was not reproduced by any other existing test.
I've fixed all the tests with the solution with the current approach. I'm now looking into the alternative solution @dean-long proposed.
@dean-long I think I've got something working but need to carry out tests on my side. It needs to be fine tuned but it does indeed look way simpler.
I've just pushed a new solution implemented, as much as possible, at the graph level. I've reverted the previous solution to keep history linear.
I've run the hotspot compiler tests successfully on darwin/aarch64 and linux/x86:
==============================
Test summary
==============================
TEST TOTAL PASS FAIL ERROR
jtreg:test/hotspot/jtreg:hotspot_compiler 1242 1242 0 0
==============================
TEST SUCCESS
Finished building target 'test' in configuration 'release-darwin-arm64'
==============================
Test summary
==============================
TEST TOTAL PASS FAIL ERROR
jtreg:test/hotspot/jtreg:hotspot_compiler 1235 1235 0 0
==============================
TEST SUCCESS
Finished building target 'test' in configuration 'release-linux-x86_64'
The array clone microbenchmark performance numbers show the expected improvements in both setups:
Benchmark (size) Mode Cnt Score Error Units
ArrayClone.byteArraycopy 0 avgt 15 3.467 ? 0.006 ns/op
ArrayClone.byteArraycopy 10 avgt 15 3.673 ? 0.047 ns/op
ArrayClone.byteArraycopy 100 avgt 15 7.255 ? 0.025 ns/op
ArrayClone.byteArraycopy 1000 avgt 15 39.119 ? 0.595 ns/op
ArrayClone.byteClone 0 avgt 15 2.996 ? 0.008 ns/op
ArrayClone.byteClone 10 avgt 15 3.117 ? 0.020 ns/op
ArrayClone.byteClone 100 avgt 15 5.487 ? 0.063 ns/op
ArrayClone.byteClone 1000 avgt 15 25.232 ? 0.445 ns/op
ArrayClone.intArraycopy 0 avgt 15 3.198 ? 0.029 ns/op
ArrayClone.intArraycopy 10 avgt 15 4.428 ? 0.065 ns/op
ArrayClone.intArraycopy 100 avgt 15 17.015 ? 0.150 ns/op
ArrayClone.intArraycopy 1000 avgt 15 176.464 ? 1.820 ns/op
ArrayClone.intClone 0 avgt 15 2.891 ? 0.019 ns/op
ArrayClone.intClone 10 avgt 15 3.794 ? 0.008 ns/op
ArrayClone.intClone 100 avgt 15 12.289 ? 0.084 ns/op
ArrayClone.intClone 1000 avgt 15 112.032 ? 3.787 ns/op
Finished running test 'micro:java.lang.ArrayClone'
Test report is stored in build/release-darwin-arm64/test-results/micro_java_lang_ArrayClone
Benchmark (size) Mode Cnt Score Error Units
ArrayClone.byteArraycopy 0 avgt 15 7.209 ? 0.168 ns/op
ArrayClone.byteArraycopy 10 avgt 15 8.433 ? 0.154 ns/op
ArrayClone.byteArraycopy 100 avgt 15 13.175 ? 1.063 ns/op
ArrayClone.byteArraycopy 1000 avgt 15 143.002 ? 2.946 ns/op
ArrayClone.byteClone 0 avgt 15 6.909 ? 0.120 ns/op
ArrayClone.byteClone 10 avgt 15 8.129 ? 0.744 ns/op
ArrayClone.byteClone 100 avgt 15 9.196 ? 0.201 ns/op
ArrayClone.byteClone 1000 avgt 15 60.831 ? 1.147 ns/op
ArrayClone.intArraycopy 0 avgt 15 6.674 ? 0.137 ns/op
ArrayClone.intArraycopy 10 avgt 15 8.815 ? 0.176 ns/op
ArrayClone.intArraycopy 100 avgt 15 42.191 ? 1.376 ns/op
ArrayClone.intArraycopy 1000 avgt 15 553.157 ? 51.817 ns/op
ArrayClone.intClone 0 avgt 15 6.376 ? 0.130 ns/op
ArrayClone.intClone 10 avgt 15 7.690 ? 0.652 ns/op
ArrayClone.intClone 100 avgt 15 25.115 ? 0.483 ns/op
ArrayClone.intClone 1000 avgt 15 245.194 ? 17.418 ns/op
Finished running test 'micro:java.lang.ArrayClone'
Test report is stored in build/release-linux-x86_64/test-results/micro_java_lang_ArrayClone
Some linux 686 tests failing due to frame map not being big enough (side effect of not instantiating ArrayCopyStub for clone like use cases). I've ported the fix from the old approach to the new making necessary adjustments. I ran hotspot compiler tests with only C1 successfully on linux/x86_64 and linux/686 with the fix. Let's see if CI agrees.
/reviewers 2
@dean-long The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).
@dean-long any chance you could have another look at this?
I'm looking at it again, and I'm trying to figure how we can minimize platform-specific changes. I'm hoping we can move some of the set_force_reexecute boiler-plate code into shared code. We probably don't need _force_reexecute in CodeEmitInfo anymore.
Your front-end changes require back-end changes, which are only implemented for x86 and aarch64. So you need a way to disable this for other platforms, or port the fix to all platforms. Minimizing the amount of platform-specific code required would also help.
I'm struggling to understand what it is you think is missing in the PR. I have added the following 2 sections in such a way that they only trigger in x86 and aarch64. See here and here, and as far as I understand it, that's enough to address your concerns. Please let me know if there is something I might have missed.
IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support?
@galderz This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays
Reviewed-by: dlong, roland
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 739 new commits pushed to the master branch:
- c642f44bbe1e4cdbc23496a34ddaae30990ce7c0: 8329839: Cleanup ZPhysicalMemoryBacking trace logging
- d04ac14bdbab4187d0be98b8471f90be8a14f649: 8332236: javac crashes with module imports and implicitly declared class
- 4e77cf881d031e5b0320915b3eabd7702e560291: 8330795: C2: assert((uint)type <= T_CONFLICT && _zero_type[type] != nullptr) failed: bad type with -XX:-UseCompressedClassPointers
- 7b4ba7f90ab9ea5e1070c79534c587dad17d1bdd: 8325932: Replace ATTRIBUTE_NORETURN with direct [[noreturn]]
- 0bb5ae645165b97527ecccf02308df6072c363d8: 8332248: (fc) java/nio/channels/FileChannel/BlockDeviceSize.java failed with RuntimeException
- 4d32c607a4b496bf2bb09e54167ecbbab5569a0c: 8322008: Exclude some CDS tests from running with -Xshare:off
- e91492ab4333c61f39b50eb428fa932131a5b908: 8313674: (fc) java/nio/channels/FileChannel/BlockDeviceSize.java should test for more block devices
- 95a601316de06b4b0fbf6e3c7777be5d2a1ca978: 8332042: Move MEMFLAGS to its own include file
- 5a4415a6bddb25cbd5b87ff8ad1a06179c2e452e: 8331858: [nmt] VM.native_memory statistics should work in summary mode
- 4ba74475d44831c1fe49359458163cd1567e9619: 8326957: Implement JEP 474: ZGC: Generational Mode by Default
- ... and 729 more: https://git.openjdk.org/jdk/compare/eebcc2181fe27f6aa10559233c7c58882a146f56...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dean-long, @lmesnik, @rwestrel) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).
IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support?
I might be wrong but the way I understood the code, I think other platforms will have no issue with that with the way the current code works:
The check I added in Compiler::is_intrinsic_supported() means that for clone calls in other platforms it would return false. If that returns false, AbstractCompiler::is_intrinsic_available will return false. Then this means that in GraphBuilder::try_inline_intrinsics is_available would be false, in which case the method will always return false and build_graph_for_intrinsic will not be called. GraphBuilder::append_alloc_array_copy is called from build_graph_for_intrinsic, so I don't see a danger of that being called for non-supported platforms.
OK, I missed the is_intrinsic_supported change. The platform-specific changes should probably have a comment saying they are for clone support. Also, I was hoping there was a way to minimize platform-specific changes, maybe by handing the force_reexecute inheritance in state_for(), and putting the state in x->state() instead of x->state_before().
@galderz this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout topic.0131.c1-array-clone
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
OK, I missed the is_intrinsic_supported change. The platform-specific changes should probably have a comment saying they are for clone support.
That's fine for me, but the platform-specific changes are a bit spread around. Anywhere in particular you'd want the comment(s) to go in? Or shall the comments be added in all platform specific changes?
Also, I was hoping there was a way to minimize platform-specific changes, maybe by handing the force_reexecute inheritance in state_for()....
I'm not sure exactly what you mean, but let me try to explain the x->state_before() bit. So I guess you're referring to this piece of logic:
CodeEmitInfo* info = nullptr;
if (x->state_before() != nullptr && x->state_before()->force_reexecute()) {
info = state_for(x, x->state_before());
info->set_force_reexecute();
} else {
info = state_for(x, x->state());
}
This code was added to deal with SEGV failures coming out of compiler/interpreter/Test6833129.clone_and_verify, which is configure to stress deoptimizations. In append_alloc_array_copy, I construct NewTypeArray and the array_copy Intrinsic both with state_before so that if any deoptimizations happen at either stage, re-execution happen from the point before clone was called.
I guess I can move the code up to LIRGenerator::state_for and if the intrinsic is either array copy or new type array, apply that logic, is that what you are after?
... putting the state in x->state() instead of x->state_before()
Hmmm, you mean this instead?
CodeEmitInfo* info = nullptr;
if (x->state_before() != nullptr && x->state_before()->force_reexecute()) {
info = state_for(x, x->state());
info->set_force_reexecute();
} else {
info = state_for(x, x->state());
}
Granted that it could be tidied but not sure I understand how state can work here. Before this change both new type array and array copy called CodeEmitInfo* info = state_for(x, x->state()); which didn't seem to be enough to be able to go back to re-executing bytecode before clone was called. It seemed to leave things in a half-done state when deoptimizations happened while clone's intrinsic was doing either a new type array or array copy.
I only wanted the comments around the boilerplate force_reexecute() logic, but if you are happy with my idea to move that logic into LIRGenerator::state_for then the comment could go there. If not, I may look at it in a follow-up RFE, because I would like to get rid of the force_reexecute() hack that I added and see if I can instead tie it to the use of state_before() or ValueStack::StateBefore.
I only wanted the comments around the boilerplate force_reexecute() logic, but if you are happy with my idea to move that logic into LIRGenerator::state_for then the comment could go there. If not, I may look at it in a follow-up RFE, because I would like to get rid of the force_reexecute() hack that I added and see if I can instead tie it to the use of state_before() or ValueStack::StateBefore.
It might be better handled as a follow-up. The reason for the boilerplate code to look the way it does is because the original code didn't set set_force_reexecute. So by shaping the code in that way, I was trying to limit the impact of my changes to the specific case of my clone intrinsic, where both state before and force reexecute are set for the new type array and array copy intrinsics.
My suggested cleanup is here: https://github.com/dean-long/jdk/tree/pr/17667 Also, you'll need a 2nd review.
My patch still needs some work.
I think we could eventually relax the requirement that receiver_klass be loaded, at least for object arrays, but for simplicity my patch will follow the existing behavior.