jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8289943: Simplify some object allocation merges

Open JohnTortugo opened this issue 2 years ago • 43 comments

Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge?

The basic idea for this approach was discussed in this thread and it consists of:

  1. Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node).
  2. Scalar Replace the incoming allocations to the RAM node.
  3. Scalar Replace the RAM node itself.

There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs:

  • The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN.

These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints:

  • The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields.
  • The way I check if there is an incoming Allocate node to the original Phi node.
  • The way I check if there is no store to the merged objects after they are merged.

Testing:

  • Windows/Linux/MAC fastdebug/release
    • hotspot_all
    • tier1
    • Renaissance
    • dacapo
    • new IR-based tests

Progress

  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [ ] Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9073/head:pull/9073
$ git checkout pull/9073

Update a local copy of the PR:
$ git checkout pull/9073
$ git pull https://git.openjdk.org/jdk pull/9073/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9073

View PR using the GUI difftool:
$ git pr show -t 9073

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9073.diff

JohnTortugo avatar Jun 07 '22 23:06 JohnTortugo

:wave: Welcome back JohnTortugo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Jun 07 '22 23:06 bridgekeeper[bot]

@JohnTortugo this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout allocation-merges
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Jun 07 '22 23:06 openjdk[bot]

⚠️ @JohnTortugo This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk[bot] avatar Jun 07 '22 23:06 openjdk[bot]

@JohnTortugo The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Jun 07 '22 23:06 openjdk[bot]

Hi there, can someone please take a look and let me know if this is going in a reasonable direction?

@vnkozlov - is this more or less on the lines of what you were thinking?

JohnTortugo avatar Jun 22 '22 20:06 JohnTortugo

This is good starting point. To have new "Phi" type node to collect information about merged allocation. I need more time to dive into changes to give review.

Thanks for taking the time to look into this!

I currently found one issue we need discuss - merge allocation of different subclasses (of the same parent class) which may have different number of fields. Current implementation assume objects are the same but I don't see the check for it during RAM node creation. May be we should have it at this initial implementation.

Got it. I'll create some tests for this and see what happens.

What about adjust_scalar_replaceable_state() code mark allocation as non-SR if they are merged?

I didn't get this part. Can you please clarify?

About input memory slices. Since merged allocation are SR we should have some new memory Phi created in EA split_memory_phi() which we can try to identify instead of adding all memory slices we find (I am talking about RAM constructor).

I'll take a look into that. Thanks for the suggestion.

I see you bailed compilation to recompile in case you can't remove RAM node. I think it is fine for initial implementation.

Great.

JohnTortugo avatar Jul 11 '22 19:07 JohnTortugo

What about adjust_scalar_replaceable_state() code mark allocation as non-SR if they are merged?

I didn't get this part. Can you please clarify?

My bad. After looking more on changes I noticed that you exit compute_escape() before adjust_scalar_replaceable_state() is called. So my comment is null.

vnkozlov avatar Jul 11 '22 23:07 vnkozlov

After looking more on changes I noticed that you exit compute_escape() before adjust_scalar_replaceable_state() is called.

No worries. I'm currently working to make reduce_allocation_merges be executed as part of compute_escape so that we can take benefit from iterative EA executions.

JohnTortugo avatar Jul 12 '22 00:07 JohnTortugo

I run into an error

$make test TEST=compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java CONF=linux-x86_64-server-fastdebug

One or more @IR rules failed:

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testPollutedPolymorphic(boolean,int)" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIf={}, applyIfAnd={}, failOn={}, applyIfOr={}, counts={"(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java)", "2"}, applyIfNot={})"
     - counts: Graph contains wrong number of nodes:
       * Regex 1: (.*precise .*\R((.*(?i:mov|xorl|nop|spill).*|\s*|.*LGHI.*)\R)*.*(?i:call,static).*wrapper for: _new_instance_Java)
         - Failed comparison: [found] 0 = 2 [given]
         - No nodes matched!

>>> Check stdout for compilation output of the failed methods


  #############################################################
   - To only run the failed tests use -DTest, -DExclude,
     and/or -DScenarios.
   - To also get the standard output of the test VM run with
     -DReportStdout=true or for even more fine-grained logging
     use -DVerbose=true.
  #############################################################


compiler.lib.ir_framework.driver.irmatching.IRViolationException: There were one or multiple IR rule failures. Please check stderr for more information.
        at compiler.lib.ir_framework.driver.irmatching.IRMatcher.throwIfNoSafepointWhilePrinting(IRMatcher.java:91)
        at compiler.lib.ir_framework.driver.irmatching.IRMatcher.reportFailures(IRMatcher.java:82)
        at compiler.lib.ir_framework.driver.irmatching.IRMatcher.applyIRRules(IRMatcher.java:54)
        at compiler.lib.ir_framework.driver.irmatching.IRMatcher.<init>(IRMatcher.java:43)
        at compiler.lib.ir_framework.TestFramework.runTestVM(TestFramework.java:729)
        at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:698)
        at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:329)
        at compiler.lib.ir_framework.TestFramework.runWithFlags(TestFramework.java:237)
        at compiler.c2.irTests.scalarReplacement.AllocationMergesTests.main(AllocationMergesTests.java:37)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:578)
        at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312)
        at java.base/java.lang.Thread.run(Thread.java:1596)

2 objects do get scalarized in testPollutedPolymorphic.

navyxliu avatar Jul 14 '22 23:07 navyxliu

Thanks for taking a look @navyxliu . I'm working on an improvement (and fixing a bug) and I'll include your suggestions in the next push.

JohnTortugo avatar Jul 15 '22 23:07 JohnTortugo

@JohnTortugo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Aug 13 '22 00:08 bridgekeeper[bot]

Hi @vnkozlov , @navyxliu - I think I addressed all your comments so far. Can I ask you to please take a look again? I made several modifications for fixing bugs and improving the coverage of the optimization. In this new version the simplification of allocations is performed in every EA iteration, the memory edges in RAM are updated after split_unique_types is computed and only allocations merged inputs of the same _klass are simplified.

I tested this in several benchmarks and didn't find any regression. Nevertheless, I'm still going to run more tests with different JVM configurations.

Can I ask you (@TobiHartmann , @rwestrel , @iwanowww) to also take a look?

JohnTortugo avatar Aug 22 '22 20:08 JohnTortugo

I currently found one issue we need discuss - merge allocation of different subclasses (of the same parent class) which may have different number of fields. Current implementation assume objects are the same but I don't see the check for it during RAM node creation. May be we should have it at this initial implementation.

Hi again @vnkozlov - The current implementation requires the Klass of all inputs to the Phi to be the same. I'm trying to lift that restriction as it would increase the number of cases where the merge can be removed. I've been trying to create an example where I'd need to enforce the allocations to be of the same type but I couldn't figure one out yet. Do you have an example where that must be enforced? Note that I only need fields accessed through the merge Phi node, which means only fields present in the superclass AFAIU.

JohnTortugo avatar Aug 31 '22 20:08 JohnTortugo

Hi @JohnTortugo Yes, I think you can lift restriction for all inputs have the same class. I looked more and you correctly set RAM's type to original Phi's type - it should be correct type (base class, which could be also one of inputs). You correctly register fields offsets based on this (Phi) class during RAM make. C2 type system and bytecode should guarantee that Phi's type is correct (meet of all inputs). It could be more narrowed during CCP phase but type you got serves correctly for this implementation. May be assert that it is not interface - or skip such cases if you find them.

You can add assert(input_t->is_instptr()->instance_klass()->is_subtype_of(klass)) instead of klass equals check.

What confused me is code in PhaseMacroExpand::scalar_replacement() where you process Allocation from one of inputs which may have more fields than RAM registered. But you correctly have !ram->needs_field(offset) check which I missed before. That why I commented.

vnkozlov avatar Aug 31 '22 23:08 vnkozlov

Allocations in testPollutedPolymorphic() are removed because both classes have the same Shape class which have all fields. Would be interesting if l field is declared only in both subclasses.

vnkozlov avatar Sep 01 '22 00:09 vnkozlov

@vnkozlov - Thank you for clarifying that. I've been playing with lifting the restriction and I actually found a corner case:

public static Class test(boolean c1, boolean c2, boolean c3, int x, int y, int w, int z) {
    Animal s = new Dog(x, y, z);

    if (c1) {
        s = new Cat("Fisker");
    }

    Unloaded u = new Unloaded(); // assumes this is converted to a uncommon_trap(unloaded, reinterpret)

    return s.getClass();
}

It seems that when merging allocations of different subtypes I'll need to add a special Phi node merging the Klass of the input allocations and assign the output of that Phi to SafepointScalarReplacedNode. If I don't do that, the method above will return Animal.class instead of Dog.class or Cat.class. I'm wondering if I'll actually have to do the same for the Header/Mark word of the input allocations.

JohnTortugo avatar Sep 02 '22 00:09 JohnTortugo

@JohnTortugo Yes, you would need to construct Phi when you replace RAM with SafePointScalarObjectNode. Hmm, may be you would need to construct Phi in other cases too (getClass intrinsic).

Add cases when class is loaded from argument for allocation: Unsafe.allocateInstance() and Object.clone() to test class Load instead of simple constant on some paths of such Phi.

Why you need Phi for mark word? For identity check (hashCode/identityHashCode intrinsics)?

vnkozlov avatar Sep 02 '22 16:09 vnkozlov

@JohnTortugo Yes, you would need to construct Phi when you replace RAM with SafePointScalarObjectNode. Hmm, may be you would need to construct Phi in other cases too (getClass intrinsic).

Yes. I'll take a look into getClass intrinsic. I thought that just adding input to SafePointScalarObjectNode+Safepoint with a Phi of the input allocations Klass fields would be enough for the code to correctly access fields/methods of the input objects. However, it looks like the Klass of the object serialized/deserialized from SafePointScalarObjectNode is the type of SafePointScalarObjectNode, not the Klass field input edge set in the Safepoint. In other words, the object represented in SafePointScalarObjectNode needs to be exact AFAIU.

Add cases when class is loaded from argument for allocation: Unsafe.allocateInstance() and Object.clone() to test class Load instead of simple constant on some paths of such Phi.

I'll do that. Thanks for the tip.

Why you need Phi for mark word? For identity check (hashCode/identityHashCode intrinsics)?

Yes, I was wondering about a case where I'd need any of the information present in the Mark word. The code in the PR is able to solve merges where only some of the merged allocations are removed. I believe I'll need, at least, to preserve the Mark word of the allocations not removed.

JohnTortugo avatar Sep 02 '22 20:09 JohnTortugo

Hi again @vnkozlov, I ended up giving up on adding support for merging allocations of different subclasses. I think the current version of the patch already covers some reasonable ground and it seems that adding support for merging allocations of subclasses will require more widespread changes.

Also, I wasn't able to find a case where the patch fail due to an incorrect/missing Mark word. FYI we have been testing the current patch with all JTREG tests, TCK tests, SPECJBB, Renaissance, DaCapo, and some customer services. I currently don't see any failure.

I also added a few JMH tests to show up what kind of improvements we can expect with the proposed changes:

Benchmark                                                         Mode  Cnt         Score        Error  Units

AllocationsMerge.IfElseInLoop                                     avgt    5  93,839,709.289 ± 16846405.548  ns/op
AllocationsMerge.MergeAndIterative                                avgt    5     160,988.682 ±    57389.599  ns/op
AllocationsMerge.NestedObjectsObject                              avgt    5     311,523.642 ±    59622.703  ns/op
AllocationsMerge.SimpleMerge                                      avgt    5     142,398.057 ±    57010.144  ns/op
AllocationsMerge.TrapAfterMerge                                   avgt    5     333,061.738 ±    25645.896  ns/op

AllocationsMerge.WithAllocationsMergeEnabled.IfElseInLoop         avgt    5  11,964,798.596 ±  1661843.496  ns/op
AllocationsMerge.WithAllocationsMergeEnabled.MergeAndIterative    avgt    5      33,118.428 ±     2995.224  ns/op
AllocationsMerge.WithAllocationsMergeEnabled.NestedObjectsObject  avgt    5     155,808.370 ±    30379.710  ns/op
AllocationsMerge.WithAllocationsMergeEnabled.SimpleMerge          avgt    5      31,266.188 ±    13476.634  ns/op
AllocationsMerge.WithAllocationsMergeEnabled.TrapAfterMerge       avgt    5     208,694.038 ±    18667.023  ns/op

Please let me know if there is something else that I can do to make it easier to review the proposed changes.

JohnTortugo avatar Sep 20 '22 18:09 JohnTortugo

Thank you for taking a look @vnkozlov . I'll address your comments asap.

JohnTortugo avatar Sep 21 '22 01:09 JohnTortugo

Hi @vnkozlov - I pushed some changes that I think address the comments that you made. I also had to add another constraint to reducing the allocations: for now, I won't be reducing merges of SR and NSR allocations. I'll definitely work on that in a subsequent patch.

Thank you again for reviewing and I'd appreciate it very much if you could check if the patch doesn't break any of your tests.

JohnTortugo avatar Sep 28 '22 00:09 JohnTortugo

@JohnTortugo Good news is that I ran from tier1 to tier4 and with ReduceAllocationMerges enabled and I did not get new failures except in your new test compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java when it run with -ea -esa -XX:CompileThreshold=100 -XX:-TieredCompilation flags.

I put failed test's output to RFE.

vnkozlov avatar Sep 28 '22 18:09 vnkozlov

/reviewers 2

vnkozlov avatar Sep 28 '22 18:09 vnkozlov

We need others to look on this too at this stage.

vnkozlov avatar Sep 28 '22 18:09 vnkozlov

@vnkozlov The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

openjdk[bot] avatar Sep 28 '22 18:09 openjdk[bot]

And I will submit performance runs.

vnkozlov avatar Sep 28 '22 18:09 vnkozlov

Thanks a lot for the comments @vnkozlov . I'll push new changes addressing your suggestions later today.

JohnTortugo avatar Sep 28 '22 18:09 JohnTortugo

And I will submit performance runs.

Performance results are mixed bag. Saw some improvements on x64 in Renaissance but they gone after rerun. I see some significant regression in Dacapo on Linux Aarch64 (Ampere CPU) but don't see it at all on MacOS M1.

vnkozlov avatar Sep 30 '22 17:09 vnkozlov

Hi @vnkozlov & @merykitty - thank you for the reviews. I'm working to address your comments. I had to take some time off of work due to personal issues but I believe I can address your comments during the weekend or earlier next week.

@vnkozlov - Thank you so much for running the perf. tests! Can you please give me more details about the experiment? What metric did you analyze, which version of Dacapo, and was it any specific benchmark that regressed or all of them? Which JVM parameters?

Regards, Cesar

JohnTortugo avatar Sep 30 '22 22:09 JohnTortugo