jdk 8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage

Refactor Parallel full-gc to use the same algorithm (mark-compact) as Serial and G1 full-GC. This removes the obj-end bitmap. When GC threads are few, the old implementation can be more efficient because it requires fewer heap iterations. The new full-GC implementation, on the other hand, is more scalable because it introduces more phases (forward_to_new_addr and adjust_pointers) that can partition work effectively.

The diff is rather large, so reading the new code directly from invoke_no_policy is probably easier.

Test: tier1-6; some improvement in Dacapo-h2, CacheStresser, but no difference in specjbb2015, specjvm2008.

Progress

[x] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage (Enhancement - P4)

Reviewers

Roman Kennke (@rkennke - Reviewer) ⚠️ Review applies to 51c2b07a

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19101/head:pull/19101
$ git checkout pull/19101

Update a local copy of the PR:
$ git checkout pull/19101
$ git pull https://git.openjdk.org/jdk.git pull/19101/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19101

View PR using the GUI difftool:
$ git pr show -t 19101

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19101.diff

Webrev

Link to Webrev Comment

May 06 '24 10:05 albertnetymk

:wave: Welcome back ayang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

May 06 '24 10:05 bridgekeeper[bot]

@albertnetymk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage

Reviewed-by: rkennke, gli

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the master branch:

ae9ad862ee54e119553efec919f1061dca36b954: 8331934: [s390x] Add support for primitive array C1 clone intrinsic
3479b46c5bea3afd92b6ab4acd2fe7f274df38aa: 8332595: Serial: Remove unused TenuredGeneration::should_collect

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

May 06 '24 10:05 openjdk[bot]

@albertnetymk The following label will be automatically applied to this pull request:

hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

May 06 '24 10:05 openjdk[bot]

Webrevs

May 06 '24 10:05 mlbridge[bot]

Thank you, that's a great change. I actually intended to do something similar soon, because the existing Parallel Full GC would not be compatible with some stuff that I need for Lilliput 2 (namely, the Compact Identity Hashcode).

Before I go into reviewing, could you describe a bit deeper how the implementation works? In particular, G1 (and Shenandoah) parallelize work by assigning regions to GC workers. (It does not matter for Serial, obviously.) Parallel GC has no such conceps of regions. How does your implementation parallelize the work? (My idea was to divide into regions for the purpose of the full-GC and deal with overlapping objects somehow, maybe by using a block-offset-table. But I haven't spent too much thought on it, so far.)

May 06 '24 11:05 rkennke

Parallel GC has no such conceps of regions.

Parallel full-gc does use "regions", ParallelCompactData::RegionData.

My idea was to divide into regions for the purpose of the full-GC and deal with overlapping objects somehow, maybe by using a block-offset-table. But I haven't spent too much thought on it, so far.

That's how existing full-gc works (e.g. partial_obj_addr for overlapping objs) and the new implementation doesn't touch/change that part. The most significant change is doing new-addr-calculation and remapping before compaction; hence, these two new phases.

May 06 '24 11:05 albertnetymk

Ok, I see. That sounds reasonable.

Have you also run the micros test/micro/org/openjdk/bench/vm/gc/systemgc/ on it?

May 06 '24 13:05 rkennke

Ok then. I tried to fit Lilliput's SlidingForwarding and works fine. I also tried to fit Lilliput2's Compact Identity Hashcode and this does not yet quite work. The problem is that we collect liveness data per region during marking, and use that as a basis to calculate the destination address of a region. However, with compact i-hash, the size of a copied object might be one word larger, which means that this calculation would be off. In essence, we don't know during marking, which size an object may have when copied. We only know that once we calculate the forwarding addresses, and it depends on the hash-bits and whether or not an object actually moves or not. Can you think of a way to not rely on marking liveness info for calculating the destination address for each region? Check out if you like: https://github.com/rkennke/lilliput/tree/lilliput-2-prototype

May 08 '24 10:05 rkennke

During forward_to_new_addr(), we give each worker a contiguous set of regions to work on.

Ok then. I tried to fit Lilliput's SlidingForwarding and works fine. I also tried to fit Lilliput2's Compact Identity Hashcode and this does not yet quite work. The problem is that we collect liveness data per region during marking, and use that as a basis to calculate the destination address of a region. However, with compact i-hash, the size of a copied object might be one word larger, which means that this calculation would be off. In essence, we don't know during marking, which size an object may have when copied. We only know that once we calculate the forwarding addresses, and it depends on the hash-bits and whether or not an object actually moves or not. Can you think of a way to not rely on marking liveness info for calculating the destination address for each region? Check out if you like: https://github.com/rkennke/lilliput/tree/lilliput-2-prototype

I guess a naive way to do that would be to use a global bump-ptr during forwarding phase, that each worker bumps up atomically, instead of relying on the pre-summarized structure. This would quickly become a scalability bottleneck, of course. So, instead of doing that, workers could claim region-sized chunks (or larger, when larger objects need to be forwarded) atomically, and bump up inside those regions thread-locally. The worker would also have to record (in the region-data) the new-top for the region. That waste-gap has to be filled later, during or after compaction. When a worker claims a larger-than-region chunk, then the next claimed (smaller-than-region) chunk would be smaller, so that its end aligns with region-boundary - this makes keeping track of the gaps easier.

I think using this approach, we could eliminate a whole lot of other stuff, the summarization and keeping track of liveness during marking might all no longer be needed. It might also be more efficient because of this. OTOH, this leads to some waste at end of regions, which could probably be minimized using a serial phase similar to what G1 does.

If you agree this is a sane approach, I'll try that for my Lilliput2 prototype - I don't think there is a need to have this in upstream just yet. Unless you want to have it. ;-)

May 08 '24 12:05 rkennke

If you agree this is a sane approach

I think it's worth exploring. Should not be part of this PR though. Feel free to try that for Lilliput2.

May 08 '24 12:05 albertnetymk

Can you check if this fixes https://bugs.openjdk.org/browse/JDK-8320165 ?

May 14 '24 12:05 rkennke

Thank you for the link; I have made comments on that ticket.

May 14 '24 20:05 albertnetymk

Thanks for review.

/integrate

May 23 '24 07:05 albertnetymk

Going to push as commit 94af3c23ea09ef2869cdc666d8170a655a0b3602. Since your change was applied there have been 29 commits pushed to the master branch:

1e5a2780d9cc8e73ce65bdccb98c1808aadd0784: 8332676: Remove unused BarrierSetAssembler::incr_allocated_bytes
c2180d141ccca0e396ee9a0cd3044c4428b963d5: 8315767: InetAddress: constructing objects from BSD literal addresses
2a11e0da026066191e4d4f30b9daca986c484630: 8332743: Update comment related to JDK-8320522
6829d9ac67fb131462d3ef1c4bdfaa07df5d6be6: 8332122: [nmt] Totals for malloc should show total peak
9d332e6591334a71335da65a4dd7b2ed0482b6cb: 8307193: Several Swing jtreg tests use class.forName on L&F classes
98f6a80852383dcbdad7292b7d269a8547d54d45: 8332490: JMH org.openjdk.bench.java.util.zip.InflaterInputStreams.inflaterInputStreamRead OOM
3d4185a9ce482cc655a4c67f39cb2682b02ae4fe: 8332739: Problemlist compiler/codecache/CheckLargePages until JDK-8332654 is fixed
c4557a7b0db5b55585b4caa7cdec81e1c1093cbc: 8332463: Byte conditional pattern case element dominates short constant case element
d59c12fe1041a1f61f68408241a9aa4d96ac4fd2: 8329718: Incorrect @since tags in elements in jdk.compiler and java.compiler
b4d14540851d792b5366a3723abcea1264a5737c: 8332740: [BACKOUT] JDK-8331081 'internal proprietary API' diagnostics if --system is configured to an earlier JDK version
... and 19 more: https://git.openjdk.org/jdk/compare/9bfae8891e6efa58c557bd6dac61de111a16f71e...master

Your commit was automatically rebased without conflicts.

May 23 '24 07:05 openjdk[bot]

@albertnetymk Pushed as commit 94af3c23ea09ef2869cdc666d8170a655a0b3602.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

May 23 '24 07:05 openjdk[bot]

jdk jdk copied to clipboard

8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage

Progress

Issue

Reviewers

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard