jdk 8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays

Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures.

The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64:

$ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1"
Benchmark                 (size)  Mode  Cnt    Score    Error  Units
ArrayClone.byteArraycopy       0  avgt   15    3.476 ?  0.018  ns/op
ArrayClone.byteArraycopy      10  avgt   15    3.740 ?  0.017  ns/op
ArrayClone.byteArraycopy     100  avgt   15    7.124 ?  0.010  ns/op
ArrayClone.byteArraycopy    1000  avgt   15   39.301 ?  0.106  ns/op
ArrayClone.byteClone           0  avgt   15    3.478 ?  0.008  ns/op
ArrayClone.byteClone          10  avgt   15    3.562 ?  0.007  ns/op
ArrayClone.byteClone         100  avgt   15    5.888 ?  0.206  ns/op
ArrayClone.byteClone        1000  avgt   15   25.762 ?  0.203  ns/op
ArrayClone.intArraycopy        0  avgt   15    3.199 ?  0.016  ns/op
ArrayClone.intArraycopy       10  avgt   15    4.521 ?  0.008  ns/op
ArrayClone.intArraycopy      100  avgt   15   17.429 ?  0.039  ns/op
ArrayClone.intArraycopy     1000  avgt   15  178.432 ?  0.777  ns/op
ArrayClone.intClone            0  avgt   15    3.406 ?  0.016  ns/op
ArrayClone.intClone           10  avgt   15    4.272 ?  0.006  ns/op
ArrayClone.intClone          100  avgt   15   13.110 ?  0.122  ns/op
ArrayClone.intClone         1000  avgt   15  113.196 ? 13.400  ns/op

It also includes an optimization to avoid instantiating the array copy stub in scenarios like this.

I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g.

$ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1"
...
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg:hotspot_compiler          1234  1234     0     0

One question I had is what to do about non-primitive object arrays, see my question on the issue. @cl4es any thoughts?

Thanks @rwestrel for his help shaping this up :)

Progress

[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue
[x] Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

JDK-8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays (Enhancement - P4)

Reviewers

Dean Long (@dean-long - Reviewer) ⚠️ Review applies to 3af9cd9e
Roland Westrelin (@rwestrel - Reviewer) ⚠️ Review applies to c3b7fa47

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17667/head:pull/17667
$ git checkout pull/17667

Update a local copy of the PR:
$ git checkout pull/17667
$ git pull https://git.openjdk.org/jdk.git pull/17667/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17667

View PR using the GUI difftool:
$ git pr show -t 17667

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17667.diff

Webrev

Link to Webrev Comment

Feb 01 '24 05:02 galderz

:wave: Welcome back galder! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Feb 01 '24 05:02 bridgekeeper[bot]

@galderz The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Feb 01 '24 05:02 openjdk[bot]

Webrevs

Feb 01 '24 05:02 mlbridge[bot]

This looks very much like what I had in mind: implement a C1 clone intrinsic by reusing existing arraycopy code as much as possible while not going through too much trouble to do so. While typically the benefits will only be to startup/warmup, the OpenJDK itself and many others do run short running processes with -XX:TieredStopAtLevel=1 where this might result in a significant overall speed-up. Nice work!

Sorry for erroneously referencing C2's ArrayCopyNode in the RFE summary. Feel free to rephrase this as "Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays"

Regarding your question on object arrays then I have no strong opinion except that if it's much more work then it's probably not worth it. It's less used than the primitive clones, e.g. java.util.Array.copyOfRange(T[], int, int, Class<? extends T[]>) does not use clone unlike its primitive array relatives. Recent changes to such APIs made use of primitive array clone more prominent, which in turn made the C1 cost difference between clone and arraycopy more apparent.

Feb 01 '24 11:02 cl4es

Thanks @cl4es for the feedback and the additional information regarding primitive vs object array cloning. Happy to leave it at primitives. I've amended the issue description.

Feb 05 '24 16:02 galderz

Rather than create a new Clone op that fuses NewTypeArray and ArrayCopy, but requires new platform-specific code, why can't we just build clone on top of the existing NewTypeArray and ArrayCopy? We just need the new flag to avoid zeroing.

Feb 07 '24 00:02 dean-long

@dean-long I hadn't considered that but I will definitely look into it. I see there's a c1 type for NewTypeArray but I don't see any for array copy. Do you have any pointers/examples on how to approach a solution like the one you suggest?

Feb 07 '24 18:02 galderz

@galderz, let me see if I can get it to work.

Feb 08 '24 01:02 dean-long

I think the right solution would be to add a line in GraphBuilder::build_graph_for_intrinsic for _clone, to append the IR for NewTypeArray and ArrayCopy as if we parsed newarray and arraycopy() from bytecodes. I'll see if I can get that working tomorrow.

Feb 08 '24 02:02 dean-long

I've added a couple of commits to fine tune the current approach (I will still explore the other option Dean mentioned):

The first commit makes sure the clone intrinsic handles deoptimizations appropiately. compiler/interpreter/Test6833129 was failing because of this.

I've also added a commit to add a null checks for the array being cloned. This also includes a jtreg test for the scenario in which the array was null after C1 compilation. This was resulting in a segv that was not reproduced by any other existing test.

Feb 08 '24 12:02 galderz

I've fixed all the tests with the solution with the current approach. I'm now looking into the alternative solution @dean-long proposed.

Feb 09 '24 05:02 galderz

@dean-long I think I've got something working but need to carry out tests on my side. It needs to be fine tuned but it does indeed look way simpler.

Feb 09 '24 10:02 galderz

I've just pushed a new solution implemented, as much as possible, at the graph level. I've reverted the previous solution to keep history linear.

I've run the hotspot compiler tests successfully on darwin/aarch64 and linux/x86:

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg:hotspot_compiler          1242  1242     0     0
==============================
TEST SUCCESS

Finished building target 'test' in configuration 'release-darwin-arm64'

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg:hotspot_compiler          1235  1235     0     0
==============================
TEST SUCCESS

Finished building target 'test' in configuration 'release-linux-x86_64'

The array clone microbenchmark performance numbers show the expected improvements in both setups:

Benchmark                 (size)  Mode  Cnt    Score   Error  Units
ArrayClone.byteArraycopy       0  avgt   15    3.467 ? 0.006  ns/op
ArrayClone.byteArraycopy      10  avgt   15    3.673 ? 0.047  ns/op
ArrayClone.byteArraycopy     100  avgt   15    7.255 ? 0.025  ns/op
ArrayClone.byteArraycopy    1000  avgt   15   39.119 ? 0.595  ns/op
ArrayClone.byteClone           0  avgt   15    2.996 ? 0.008  ns/op
ArrayClone.byteClone          10  avgt   15    3.117 ? 0.020  ns/op
ArrayClone.byteClone         100  avgt   15    5.487 ? 0.063  ns/op
ArrayClone.byteClone        1000  avgt   15   25.232 ? 0.445  ns/op
ArrayClone.intArraycopy        0  avgt   15    3.198 ? 0.029  ns/op
ArrayClone.intArraycopy       10  avgt   15    4.428 ? 0.065  ns/op
ArrayClone.intArraycopy      100  avgt   15   17.015 ? 0.150  ns/op
ArrayClone.intArraycopy     1000  avgt   15  176.464 ? 1.820  ns/op
ArrayClone.intClone            0  avgt   15    2.891 ? 0.019  ns/op
ArrayClone.intClone           10  avgt   15    3.794 ? 0.008  ns/op
ArrayClone.intClone          100  avgt   15   12.289 ? 0.084  ns/op
ArrayClone.intClone         1000  avgt   15  112.032 ? 3.787  ns/op
Finished running test 'micro:java.lang.ArrayClone'
Test report is stored in build/release-darwin-arm64/test-results/micro_java_lang_ArrayClone

Benchmark                 (size)  Mode  Cnt    Score    Error  Units
ArrayClone.byteArraycopy       0  avgt   15    7.209 ?  0.168  ns/op
ArrayClone.byteArraycopy      10  avgt   15    8.433 ?  0.154  ns/op
ArrayClone.byteArraycopy     100  avgt   15   13.175 ?  1.063  ns/op
ArrayClone.byteArraycopy    1000  avgt   15  143.002 ?  2.946  ns/op
ArrayClone.byteClone           0  avgt   15    6.909 ?  0.120  ns/op
ArrayClone.byteClone          10  avgt   15    8.129 ?  0.744  ns/op
ArrayClone.byteClone         100  avgt   15    9.196 ?  0.201  ns/op
ArrayClone.byteClone        1000  avgt   15   60.831 ?  1.147  ns/op
ArrayClone.intArraycopy        0  avgt   15    6.674 ?  0.137  ns/op
ArrayClone.intArraycopy       10  avgt   15    8.815 ?  0.176  ns/op
ArrayClone.intArraycopy      100  avgt   15   42.191 ?  1.376  ns/op
ArrayClone.intArraycopy     1000  avgt   15  553.157 ? 51.817  ns/op
ArrayClone.intClone            0  avgt   15    6.376 ?  0.130  ns/op
ArrayClone.intClone           10  avgt   15    7.690 ?  0.652  ns/op
ArrayClone.intClone          100  avgt   15   25.115 ?  0.483  ns/op
ArrayClone.intClone         1000  avgt   15  245.194 ? 17.418  ns/op
Finished running test 'micro:java.lang.ArrayClone'
Test report is stored in build/release-linux-x86_64/test-results/micro_java_lang_ArrayClone

Feb 12 '24 13:02 galderz

Some linux 686 tests failing due to frame map not being big enough (side effect of not instantiating ArrayCopyStub for clone like use cases). I've ported the fix from the old approach to the new making necessary adjustments. I ran hotspot compiler tests with only C1 successfully on linux/x86_64 and linux/686 with the fix. Let's see if CI agrees.

Feb 13 '24 11:02 galderz

/reviewers 2

Feb 16 '24 09:02 dean-long

@dean-long The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

Feb 16 '24 09:02 openjdk[bot]

@dean-long any chance you could have another look at this?

Mar 04 '24 12:03 galderz

I'm looking at it again, and I'm trying to figure how we can minimize platform-specific changes. I'm hoping we can move some of the set_force_reexecute boiler-plate code into shared code. We probably don't need _force_reexecute in CodeEmitInfo anymore.

Mar 07 '24 03:03 dean-long

Your front-end changes require back-end changes, which are only implemented for x86 and aarch64. So you need a way to disable this for other platforms, or port the fix to all platforms. Minimizing the amount of platform-specific code required would also help.

I'm struggling to understand what it is you think is missing in the PR. I have added the following 2 sections in such a way that they only trigger in x86 and aarch64. See here and here, and as far as I understand it, that's enough to address your concerns. Please let me know if there is something I might have missed.

Mar 12 '24 08:03 galderz

IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support?

Mar 12 '24 18:03 dean-long

@galderz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays

Reviewed-by: dlong, roland

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 739 new commits pushed to the master branch:

c642f44bbe1e4cdbc23496a34ddaae30990ce7c0: 8329839: Cleanup ZPhysicalMemoryBacking trace logging
d04ac14bdbab4187d0be98b8471f90be8a14f649: 8332236: javac crashes with module imports and implicitly declared class
4e77cf881d031e5b0320915b3eabd7702e560291: 8330795: C2: assert((uint)type <= T_CONFLICT && _zero_type[type] != nullptr) failed: bad type with -XX:-UseCompressedClassPointers
7b4ba7f90ab9ea5e1070c79534c587dad17d1bdd: 8325932: Replace ATTRIBUTE_NORETURN with direct [[noreturn]]
0bb5ae645165b97527ecccf02308df6072c363d8: 8332248: (fc) java/nio/channels/FileChannel/BlockDeviceSize.java failed with RuntimeException
4d32c607a4b496bf2bb09e54167ecbbab5569a0c: 8322008: Exclude some CDS tests from running with -Xshare:off
e91492ab4333c61f39b50eb428fa932131a5b908: 8313674: (fc) java/nio/channels/FileChannel/BlockDeviceSize.java should test for more block devices
95a601316de06b4b0fbf6e3c7777be5d2a1ca978: 8332042: Move MEMFLAGS to its own include file
5a4415a6bddb25cbd5b87ff8ad1a06179c2e452e: 8331858: [nmt] VM.native_memory statistics should work in summary mode
4ba74475d44831c1fe49359458163cd1567e9619: 8326957: Implement JEP 474: ZGC: Generational Mode by Default
... and 729 more: https://git.openjdk.org/jdk/compare/eebcc2181fe27f6aa10559233c7c58882a146f56...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dean-long, @lmesnik, @rwestrel) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Mar 13 '24 20:03 openjdk[bot]

IR expansion in append_alloc_array_copy() looks unconditional. What's going to happen on platforms with no back-end support?

I might be wrong but the way I understood the code, I think other platforms will have no issue with that with the way the current code works:

The check I added in Compiler::is_intrinsic_supported() means that for clone calls in other platforms it would return false. If that returns false, AbstractCompiler::is_intrinsic_available will return false. Then this means that in GraphBuilder::try_inline_intrinsics is_available would be false, in which case the method will always return false and build_graph_for_intrinsic will not be called. GraphBuilder::append_alloc_array_copy is called from build_graph_for_intrinsic, so I don't see a danger of that being called for non-supported platforms.

Mar 14 '24 12:03 galderz

OK, I missed the is_intrinsic_supported change. The platform-specific changes should probably have a comment saying they are for clone support. Also, I was hoping there was a way to minimize platform-specific changes, maybe by handing the force_reexecute inheritance in state_for(), and putting the state in x->state() instead of x->state_before().

Mar 15 '24 18:03 dean-long

@galderz this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout topic.0131.c1-array-clone
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Mar 18 '24 09:03 openjdk[bot]

OK, I missed the is_intrinsic_supported change. The platform-specific changes should probably have a comment saying they are for clone support.

That's fine for me, but the platform-specific changes are a bit spread around. Anywhere in particular you'd want the comment(s) to go in? Or shall the comments be added in all platform specific changes?

Also, I was hoping there was a way to minimize platform-specific changes, maybe by handing the force_reexecute inheritance in state_for()....

I'm not sure exactly what you mean, but let me try to explain the x->state_before() bit. So I guess you're referring to this piece of logic:

  CodeEmitInfo* info = nullptr;
  if (x->state_before() != nullptr && x->state_before()->force_reexecute()) {
    info = state_for(x, x->state_before());
    info->set_force_reexecute();
  } else {
    info = state_for(x, x->state());
  }

This code was added to deal with SEGV failures coming out of compiler/interpreter/Test6833129.clone_and_verify, which is configure to stress deoptimizations. In append_alloc_array_copy, I construct NewTypeArray and the array_copy Intrinsic both with state_before so that if any deoptimizations happen at either stage, re-execution happen from the point before clone was called.

I guess I can move the code up to LIRGenerator::state_for and if the intrinsic is either array copy or new type array, apply that logic, is that what you are after?

... putting the state in x->state() instead of x->state_before()

Hmmm, you mean this instead?

  CodeEmitInfo* info = nullptr;
  if (x->state_before() != nullptr && x->state_before()->force_reexecute()) {
    info = state_for(x, x->state());
    info->set_force_reexecute();
  } else {
    info = state_for(x, x->state());
  }

Granted that it could be tidied but not sure I understand how state can work here. Before this change both new type array and array copy called CodeEmitInfo* info = state_for(x, x->state()); which didn't seem to be enough to be able to go back to re-executing bytecode before clone was called. It seemed to leave things in a half-done state when deoptimizations happened while clone's intrinsic was doing either a new type array or array copy.

Mar 20 '24 09:03 galderz

I only wanted the comments around the boilerplate force_reexecute() logic, but if you are happy with my idea to move that logic into LIRGenerator::state_for then the comment could go there. If not, I may look at it in a follow-up RFE, because I would like to get rid of the force_reexecute() hack that I added and see if I can instead tie it to the use of state_before() or ValueStack::StateBefore.

Mar 21 '24 02:03 dean-long

I only wanted the comments around the boilerplate force_reexecute() logic, but if you are happy with my idea to move that logic into LIRGenerator::state_for then the comment could go there. If not, I may look at it in a follow-up RFE, because I would like to get rid of the force_reexecute() hack that I added and see if I can instead tie it to the use of state_before() or ValueStack::StateBefore.

It might be better handled as a follow-up. The reason for the boilerplate code to look the way it does is because the original code didn't set set_force_reexecute. So by shaping the code in that way, I was trying to limit the impact of my changes to the specific case of my clone intrinsic, where both state before and force reexecute are set for the new type array and array copy intrinsics.

Mar 25 '24 09:03 galderz

My suggested cleanup is here: https://github.com/dean-long/jdk/tree/pr/17667 Also, you'll need a 2nd review.

Apr 04 '24 22:04 dean-long

My patch still needs some work.

Apr 05 '24 02:04 dean-long

I think we could eventually relax the requirement that receiver_klass be loaded, at least for object arrays, but for simplicity my patch will follow the existing behavior.

Apr 05 '24 02:04 dean-long

jdk jdk copied to clipboard

8302850: Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays

Progress

Issue

Reviewers

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard