jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage

Open albertnetymk opened this issue 9 months ago • 12 comments

Refactor Parallel full-gc to use the same algorithm (mark-compact) as Serial and G1 full-GC. This removes the obj-end bitmap. When GC threads are few, the old implementation can be more efficient because it requires fewer heap iterations. The new full-GC implementation, on the other hand, is more scalable because it introduces more phases (forward_to_new_addr and adjust_pointers) that can partition work effectively.

The diff is rather large, so reading the new code directly from invoke_no_policy is probably easier.

Test: tier1-6; some improvement in Dacapo-h2, CacheStresser, but no difference in specjbb2015, specjvm2008.


Progress

  • [x] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19101/head:pull/19101
$ git checkout pull/19101

Update a local copy of the PR:
$ git checkout pull/19101
$ git pull https://git.openjdk.org/jdk.git pull/19101/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19101

View PR using the GUI difftool:
$ git pr show -t 19101

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19101.diff

Webrev

Link to Webrev Comment

albertnetymk avatar May 06 '24 10:05 albertnetymk

:wave: Welcome back ayang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar May 06 '24 10:05 bridgekeeper[bot]

@albertnetymk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage

Reviewed-by: rkennke, gli

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the master branch:

  • ae9ad862ee54e119553efec919f1061dca36b954: 8331934: [s390x] Add support for primitive array C1 clone intrinsic
  • 3479b46c5bea3afd92b6ab4acd2fe7f274df38aa: 8332595: Serial: Remove unused TenuredGeneration::should_collect

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar May 06 '24 10:05 openjdk[bot]

@albertnetymk The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar May 06 '24 10:05 openjdk[bot]

Thank you, that's a great change. I actually intended to do something similar soon, because the existing Parallel Full GC would not be compatible with some stuff that I need for Lilliput 2 (namely, the Compact Identity Hashcode).

Before I go into reviewing, could you describe a bit deeper how the implementation works? In particular, G1 (and Shenandoah) parallelize work by assigning regions to GC workers. (It does not matter for Serial, obviously.) Parallel GC has no such conceps of regions. How does your implementation parallelize the work? (My idea was to divide into regions for the purpose of the full-GC and deal with overlapping objects somehow, maybe by using a block-offset-table. But I haven't spent too much thought on it, so far.)

rkennke avatar May 06 '24 11:05 rkennke

Parallel GC has no such conceps of regions.

Parallel full-gc does use "regions", ParallelCompactData::RegionData.

My idea was to divide into regions for the purpose of the full-GC and deal with overlapping objects somehow, maybe by using a block-offset-table. But I haven't spent too much thought on it, so far.

That's how existing full-gc works (e.g. partial_obj_addr for overlapping objs) and the new implementation doesn't touch/change that part. The most significant change is doing new-addr-calculation and remapping before compaction; hence, these two new phases.

albertnetymk avatar May 06 '24 11:05 albertnetymk

Ok, I see. That sounds reasonable.

Have you also run the micros test/micro/org/openjdk/bench/vm/gc/systemgc/ on it?

rkennke avatar May 06 '24 13:05 rkennke

Ok then. I tried to fit Lilliput's SlidingForwarding and works fine. I also tried to fit Lilliput2's Compact Identity Hashcode and this does not yet quite work. The problem is that we collect liveness data per region during marking, and use that as a basis to calculate the destination address of a region. However, with compact i-hash, the size of a copied object might be one word larger, which means that this calculation would be off. In essence, we don't know during marking, which size an object may have when copied. We only know that once we calculate the forwarding addresses, and it depends on the hash-bits and whether or not an object actually moves or not. Can you think of a way to not rely on marking liveness info for calculating the destination address for each region? Check out if you like: https://github.com/rkennke/lilliput/tree/lilliput-2-prototype

rkennke avatar May 08 '24 10:05 rkennke

During forward_to_new_addr(), we give each worker a contiguous set of regions to work on.

Ok then. I tried to fit Lilliput's SlidingForwarding and works fine. I also tried to fit Lilliput2's Compact Identity Hashcode and this does not yet quite work. The problem is that we collect liveness data per region during marking, and use that as a basis to calculate the destination address of a region. However, with compact i-hash, the size of a copied object might be one word larger, which means that this calculation would be off. In essence, we don't know during marking, which size an object may have when copied. We only know that once we calculate the forwarding addresses, and it depends on the hash-bits and whether or not an object actually moves or not. Can you think of a way to not rely on marking liveness info for calculating the destination address for each region? Check out if you like: https://github.com/rkennke/lilliput/tree/lilliput-2-prototype

I guess a naive way to do that would be to use a global bump-ptr during forwarding phase, that each worker bumps up atomically, instead of relying on the pre-summarized structure. This would quickly become a scalability bottleneck, of course. So, instead of doing that, workers could claim region-sized chunks (or larger, when larger objects need to be forwarded) atomically, and bump up inside those regions thread-locally. The worker would also have to record (in the region-data) the new-top for the region. That waste-gap has to be filled later, during or after compaction. When a worker claims a larger-than-region chunk, then the next claimed (smaller-than-region) chunk would be smaller, so that its end aligns with region-boundary - this makes keeping track of the gaps easier.

I think using this approach, we could eliminate a whole lot of other stuff, the summarization and keeping track of liveness during marking might all no longer be needed. It might also be more efficient because of this. OTOH, this leads to some waste at end of regions, which could probably be minimized using a serial phase similar to what G1 does.

If you agree this is a sane approach, I'll try that for my Lilliput2 prototype - I don't think there is a need to have this in upstream just yet. Unless you want to have it. ;-)

rkennke avatar May 08 '24 12:05 rkennke

If you agree this is a sane approach

I think it's worth exploring. Should not be part of this PR though. Feel free to try that for Lilliput2.

albertnetymk avatar May 08 '24 12:05 albertnetymk

Can you check if this fixes https://bugs.openjdk.org/browse/JDK-8320165 ?

rkennke avatar May 14 '24 12:05 rkennke

Thank you for the link; I have made comments on that ticket.

albertnetymk avatar May 14 '24 20:05 albertnetymk

Thanks for review.

/integrate

albertnetymk avatar May 23 '24 07:05 albertnetymk

Going to push as commit 94af3c23ea09ef2869cdc666d8170a655a0b3602. Since your change was applied there have been 29 commits pushed to the master branch:

  • 1e5a2780d9cc8e73ce65bdccb98c1808aadd0784: 8332676: Remove unused BarrierSetAssembler::incr_allocated_bytes
  • c2180d141ccca0e396ee9a0cd3044c4428b963d5: 8315767: InetAddress: constructing objects from BSD literal addresses
  • 2a11e0da026066191e4d4f30b9daca986c484630: 8332743: Update comment related to JDK-8320522
  • 6829d9ac67fb131462d3ef1c4bdfaa07df5d6be6: 8332122: [nmt] Totals for malloc should show total peak
  • 9d332e6591334a71335da65a4dd7b2ed0482b6cb: 8307193: Several Swing jtreg tests use class.forName on L&F classes
  • 98f6a80852383dcbdad7292b7d269a8547d54d45: 8332490: JMH org.openjdk.bench.java.util.zip.InflaterInputStreams.inflaterInputStreamRead OOM
  • 3d4185a9ce482cc655a4c67f39cb2682b02ae4fe: 8332739: Problemlist compiler/codecache/CheckLargePages until JDK-8332654 is fixed
  • c4557a7b0db5b55585b4caa7cdec81e1c1093cbc: 8332463: Byte conditional pattern case element dominates short constant case element
  • d59c12fe1041a1f61f68408241a9aa4d96ac4fd2: 8329718: Incorrect @since tags in elements in jdk.compiler and java.compiler
  • b4d14540851d792b5366a3723abcea1264a5737c: 8332740: [BACKOUT] JDK-8331081 'internal proprietary API' diagnostics if --system is configured to an earlier JDK version
  • ... and 19 more: https://git.openjdk.org/jdk/compare/9bfae8891e6efa58c557bd6dac61de111a16f71e...master

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar May 23 '24 07:05 openjdk[bot]

@albertnetymk Pushed as commit 94af3c23ea09ef2869cdc666d8170a655a0b3602.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar May 23 '24 07:05 openjdk[bot]