jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8305896: Alternative full GC forwarding

Open rkennke opened this issue 2 years ago • 46 comments

Currently, the full-GC modes of Serial, Shenandoah and G1 GCs are forwarding objects by over-writing the object header with the new object location. Unfortunately, for compact object headers (JDK-8294992) this would not work, because the crucial class information is also stored in the header, and we could no longer iterate over objects until the headers would be restored. Also, the preserved-headers tables would grow quite large.

I propose to use an alternative algorithm for full-GC (sliding-GC) forwarding that uses a special encoding so that the forwarding information fits into the lowest 32 bits of the header.

It exploits the insight that, with sliding GCs, objects from one region will only ever be forwarded to one of two possible target regions. For this to work, we need to divide the heap into equal-sized regions. This is already the case for Shenandoah and G1, and can easily be overlaid for Serial GC, by assuming either the whole heap as a single region (if it fits) or by using SpaceAlignment-sized virtual regions.

We also build and maintain a table that has N elements, where N is the number of regions. Each entry is two addresses, which are the start-address of the possible target regions for each source region.

With this, forwarding information would be encoded like this:

  • Bits 0 and 1: same as before, we put in '11' to indicate that the object is forwarded.
  • Bit 2: Used for 'fallback'-forwarding (see below)
  • Bit 3: Selects the target region 0 or 1. Look up the base address in the table (see above)
  • Bits 4..31 The number of heap words from the target base address

This works well for all sliding GCs in Serial, G1 and Shenandoah. The exception is in G1, there is a special mode called 'serial compaction' which acts as a last-last-ditch effort to squeeze more space out of the heap by re-forwarding the tails of the compaction chains. Unfortunately, this breaks the assumption of the sliding-forwarding-table. When that happens, we initialize a fallback table, which is a simple open hash-table, and set the Bit 2 in the forwarding to indicate that we shall look up the forwardee in the fallback-table.

All the table accesses can be done unsynchronized because:

  • Serial GC is single-threaded anyway
  • In G1 and Shenandoah, GC worker threads divide up the work such that each worker does disjoint sets of regions.
  • G1 serial compaction is single-threaded

The change introduces a new (experimental) flag -XX:[+|-]UseAltGCForwarding. This flag is not really intended to be used by end-users. Instead, I intend to programatically enable it with compact object headers once they arrive (i.e. -XX:+UseCompactObjectHeaders would turn on -XX:+UseAltGCForwarding), and the flag is also useful for testing purposes. Once compact object headers become the default and only implementation, the flag and old implementation could be removed. Also, JDK-8305898 would also use the same flag to enable an alternative self-forwarding approach (also in support of compact object headers).

The change also adds a utility class GCForwarding which calls the old or new implementation based on the flag. I think it would also be used for the self-forwarding change to be proposed soon (and separately).

I also experimented with a different forwarding approach that would use per-region hashtables, but shelved it for now, because performance was significantly worse than the sliding forwarding encoding. It will become useful later when we want to do 32bit compact object headers, because then, the sliding encoding will not be sufficient to hold forwarding pointers in the header.

Testing:

  • [x] hotspot_gc -UseAltGCForwarding
  • [x] hotspot_gc +UseAltGCForwarding
  • [x] tier1 -UseAltGCForwarding
  • [x] tier1 +UseAltGCForwarding
  • [x] tier2 -UseAltGCForwarding
  • [x] tier2 +UseAltGCForwarding

Progress

  • [x] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8305896: Alternative full GC forwarding (Enhancement - P4)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13582/head:pull/13582
$ git checkout pull/13582

Update a local copy of the PR:
$ git checkout pull/13582
$ git pull https://git.openjdk.org/jdk.git pull/13582/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13582

View PR using the GUI difftool:
$ git pr show -t 13582

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13582.diff

Webrev

Link to Webrev Comment

rkennke avatar Apr 21 '23 15:04 rkennke

:wave: Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Apr 21 '23 15:04 bridgekeeper[bot]

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8305896
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Apr 21 '23 15:04 openjdk[bot]

@rkennke The following labels will be automatically applied to this pull request:

  • hotspot-gc
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Apr 21 '23 15:04 openjdk[bot]

Webrevs

mlbridge[bot] avatar Apr 27 '23 12:04 mlbridge[bot]

Nice work @rkennke. Do you have any data on the RSS impact of using the alt forwarding table for some programs? Naturally, as the main motivation of the change is to allow a memory optimization, we shouldn't be too wasteful with memory to enable said memory optimization.

fisk avatar Apr 28 '23 13:04 fisk

Nice work @rkennke. Do you have any data on the RSS impact of using the alt forwarding table for some programs? Naturally, as the main motivation of the change is to allow a memory optimization, we shouldn't be too wasteful with memory to enable said memory optimization.

It only allocates a relatively small side-table that is number-of-regions * 2words. For G1 and Shenandoah, it uses the 'natural' GC regions. A typical number of regions would be 2048, so we'd get a side-table of 4096 * 2 * 8 = 64KB. If we assume a region-size of 2M (which seems at the lower end of the spectrum) then that heap would be 4GB. That's an overhead of 0.0016% if I counted correctly. Ultimately it depends on the workload, but I believe it is an acceptable overhead.

Also, the 'memory optimization' (compact object headers) would be benefitial for the common operation, i.e. all situations. This GC forwarding change is only relevant for full-GCs which is an exceptional situation. At least in Shenandoah we try very hard to avoid it altogether, for Serial and G1 it tends to happen once in a while, but IMO not often enough that 0.0016% overhead would be a concern.

rkennke avatar Apr 28 '23 14:04 rkennke

/contributor add @shipilev

rkennke avatar Apr 28 '23 14:04 rkennke

@rkennke Contributor Aleksey Shipilev <[email protected]> successfully added.

openjdk[bot] avatar Apr 28 '23 14:04 openjdk[bot]

Nice work @rkennke. Do you have any data on the RSS impact of using the alt forwarding table for some programs? Naturally, as the main motivation of the change is to allow a memory optimization, we shouldn't be too wasteful with memory to enable said memory optimization.

It only allocates a relatively small side-table that is number-of-regions * 2words. For G1 and Shenandoah, it uses the 'natural' GC regions. A typical number of regions would be 2048, so we'd get a side-table of 4096 * 2 * 8 = 64KB. If we assume a region-size of 2M (which seems at the lower end of the spectrum) then that heap would be 4GB. That's an overhead of 0.0016% if I counted correctly. Ultimately it depends on the workload, but I believe it is an acceptable overhead.

Also, the 'memory optimization' (compact object headers) would be benefitial for the common operation, i.e. all situations. This GC forwarding change is only relevant for full-GCs which is an exceptional situation. At least in Shenandoah we try very hard to avoid it altogether, for Serial and G1 it tends to happen once in a while, but IMO not often enough that 0.0016% overhead would be a concern.

Sounds good.

fisk avatar Apr 28 '23 14:04 fisk

Hi Roman,

Small general concern, the last-last-ditch-GC fallback table may be impractical cost-wise. How large is that expected to grow? You pay 24+x (~48 on glibc with internal overhead) bytes per forwarded oop.

Very easy first-step mitigation: Let the table house the first n (1000-10000) nodes as an inline member array. Allocate nodes from there, only allocate spilloffs from C-heap. Allocation would be a lot faster and cheaper memory wise, and its just some lines of code.

I did some experiments with the only jtreg test that seems to exercise the G1 serial compaction (and thus the fallback-table) (the test is: gc/stress/TestMultiThreadStressRSet.java). With fallback-table size 128 I'd typically end up with several dozens excess nodes, sometimes more than the base table size. Up to table size of 512 this reduces signicantly but still typically one to several dozen extra nodes. When I switched to table-size of 1024 the extra nodes count drops to below one dozen in most cases. I'll leave the table-size at this value until we find a good reason to extend it, ok?

rkennke avatar May 02 '23 16:05 rkennke

@rkennke This change is no longer ready for integration - check the PR body for details.

openjdk[bot] avatar May 03 '23 11:05 openjdk[bot]

Performance: the worst case I can come up with is Serial Full GC that moves the entire heap full of smallest objects, like this:

public class Retain {
	static final int RETAINED = Integer.getInteger("retained", 10_000_000);
	static final int GCS      = Integer.getInteger("gcs", 100);
	static Object[] OBJECTS = new Object[RETAINED];

	public static void main(String... args) {
		for (int t = 0; t < GCS; t++) {
			for (int c = 0; c < RETAINED; c++) {
				OBJECTS[c] = new Object();
			}
			System.gc();
		}
	}
}

On my c6n.8xlarge instance, with java -Xmx1g -Xlog:gc -XX:+UseSerialGC Retain.java, I see:

 baseline: 364 +- 5 ms
 patched, -AltGCForwarding: 385 +- 3 ms [+6%]
 patched, +AltGCForwarding: 445 +- 5ms [+22%]

There are regressions even with -AltGCForwarding, and judging from the profiles and the point experiments, those are caused by the AltGCForwarding flag checks for every forward_to and forwardee, split evenly between these two paths. But given the very targeted workload above running back-to-back Full GCs intentionally, this regression looks okay. (I think the only way to dodge it would be to template the bunch of GC code and dispatch to it once per GC phase, rather than per oop, which would be very intrusive and serve no practical need, IMO.)

The regression with +AltGCForwarding looks impressive in comparison: it is "only" worth three flag checks or so. The code I am seeing in profiles is already quite polished, so we would unlikely squeeze more from it without investing much more time.

I don't think any of this would show up at larger benchmarks running in usual (young, mixed) GC modes. Indeed, I ran a few point experiments, and there seem to be no visible change.

shipilev avatar May 03 '23 14:05 shipilev

We may squeeze out a bit more performance for the +AltGCForwarding case, eliminate one ref hop, by inlining the bases table into the SlidingForwarding object. Let the table follow the object, and maybe reduce some member sizes (either one of _num_regions, _region_size_word_shift can be 32bit, for instance). At least if we have very few regions that would let the whole table live on the same cache line together with the SlidingForwarding mother object.

Simplest way to do that would be to add a fixed sized array to the object and use it as backing memory if bases table size is <= that array size, otherwise dynamically allocate it.

Maybe not worth the work, up to you of course.

tstuefe avatar May 03 '23 15:05 tstuefe

LGTM. I ok this now; my remaining comments are suggestions - up to you to take them or not.

Thanks!

Tests are missing. A simple way would be to run a selection of our standard GC tests with +AltGCForwarding. This is especially important if you follow Thomas' advice and make AltGCForwarding a develop switch.

I run hotspot_gc with UseAltGCForwarding turned on. Not sure if there is an easy way to make this a test task. I could perhaps add a few run configurations to tests that are useful. For example, gc/stress/TestMultiThreadStressRSet.java tended to exercise both the sliding-forwarding and the fallback-forwarding But would the develop-only switch not complicate this? Because it means we could only run such tests in debug builds.

The only other thing that occurred to me is that you could probably change initialization: don't require caller to specify it but calculate it yourself such that the 28 bit offset is maximally used. That would save some memory since the bases table can be smaller. Again, up to you.

Yeah maybe. SpaceAlignment should be set by all GCs to a reasonable region-size, we could probably just pick that up. OTOH, we need a little bit of cooperation from the GC here: The whole sliding-forwarding algo relies on the fact that GC workers divide up their work based on their regions, and are essentially single-threaded within their work queues. I'm a bit worried about touching this stuff at this point, and cause another round of reviews.

rkennke avatar May 04 '23 14:05 rkennke

@tschatzl are you good with this PR now?

rkennke avatar May 05 '23 08:05 rkennke

I have been working aggressively strength-reducing operations and use biased arrays (yes, you can, not sure why not?) (performance seems better if you separate the ´_bases_table` into two logical ones too), reducing the overhead from baseline to +UseAltGCForwarding from a bit more than 10% to ~7.8% you gave on an Ice Lake machine. Same code also shows improvements on some Zen3 machine (from ~9% to 5.3% - but the result is noisy and the machine isn't "clean"), both on that benchmark you gave. tada

Let me clean up the code a bit and provide to you on Monday so that you can test too if you are interested.

Actually, current code is at https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:alt-fullgc-forwarding?expand=1 if you want to play with it ; the strength reduction commit does strength reduction and biasing, keeping current code intact, the other two split the bases_table into two separate tables and clean up a bit.

Hopefully I wasn't just dreaming or something when testing it (fwiw, I only kind of guarantee that it runs that Retain test. Not really tested otherwise). But it tended to crash if I didn't get things right for that test.

tschatzl avatar May 05 '23 16:05 tschatzl

I have been working aggressively strength-reducing operations and use biased arrays (yes, you can, not sure why not?) (performance seems better if you separate the ´_bases_table` into two logical ones too), reducing the overhead from baseline to +UseAltGCForwarding from a bit more than 10% to ~7.8% you gave on an Ice Lake machine. Same code also shows improvements on some Zen3 machine (from ~9% to 5.3% - but the result is noisy and the machine isn't "clean"), both on that benchmark you gave. tada Let me clean up the code a bit and provide to you on Monday so that you can test too if you are interested.

Actually, current code is at https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:alt-fullgc-forwarding?expand=1 if you want to play with it ; the strength reduction commit does strength reduction and biasing, keeping current code intact, the other two split the bases_table into two separate tables and clean up a bit.

Hopefully I wasn't just dreaming or something when testing it (fwiw, I only kind of guarantee that it runs that Retain test. Not really tested otherwise). But it tended to crash if I didn't get things right for that test.

Nice! Thank you for doing that! I'll give it a test as soon as I get to it (perhaps only on Monday). Would you be ok if I integrate it into the main PR once my testing is good? Or do you have more things you want to do on that branch?

rkennke avatar May 05 '23 16:05 rkennke

The https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:alt-fullgc-forwarding?expand=1 branch now contains the promised cleanup.

tschatzl avatar May 08 '23 10:05 tschatzl

The https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:alt-fullgc-forwarding?expand=1 branch now contains the promised cleanup.

Thanks, Thomas! This looks useful. I've merged your branch into this PR.

rkennke avatar May 08 '23 10:05 rkennke

/contributor add @tschatzl

rkennke avatar May 08 '23 10:05 rkennke

⚠️ @rkennke This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk[bot] avatar May 08 '23 10:05 openjdk[bot]

@rkennke Contributor Thomas Schatzl <[email protected]> successfully added.

openjdk[bot] avatar May 08 '23 10:05 openjdk[bot]

it can forward objects from one region to a maximum of two regions

Those two regions must be adjacent, right? Something like: _biased_bases[1][from_reg_idx] - _biased_bases[0][from_reg_idx] == _region_size_words.

If that's the case, I don't understand why the to-address is broken into two parts [offset][alternate region select]. Is it possible to double the size of the to-region so that "a maximum of two regions" becomes a single, large to-region? (I think this means _biased_bases can be shrunk to a 1D array.)

albertnetymk avatar May 08 '23 12:05 albertnetymk

it can forward objects from one region to a maximum of two regions

Those two regions must be adjacent, right? Something like: _biased_bases[1][from_reg_idx] - _biased_bases[0][from_reg_idx] == _region_size_words.

If that's the case, I don't understand why the to-address is broken into two parts [offset][alternate region select]. Is it possible to double the size of the to-region so that "a maximum of two regions" becomes a single, large to-region? (I think this means _biased_bases can be shrunk to a 1D array.)

No, the target regions don't have to be adjacent. Parallel full GCs divide up the work across worker threads, assigning from-regions to each worker, and each worker would then claim to-regions as they go and work with that. To-regions of different workers can interleave and are often not adjacent.

rkennke avatar May 08 '23 12:05 rkennke

Parallel full GCs divide up ...

Why is Parallel relevant here?

The description mentions only "the full-GC modes of Serial, Shenandoah and G1 GCs are", so I assumed this feature doesn't affect/depend on Parallel.

albertnetymk avatar May 08 '23 13:05 albertnetymk

Parallel full GCs divide up ...

Why is Parallel relevant here?

The description mentions only "the full-GC modes of Serial, Shenandoah and G1 GCs are", so I assumed this feature doesn't affect/depend on Parallel.

Sorry I wasn't clear enough. I meant the full-GCs of G1 and Shenandoah, which use multiple threads to do their work. Parallel GC is indeed not relevant. Also, Serial GC doesn't really compact serially, because it divides the heap into old, young-eden, young-s1 and young-s2 spaces.

And btw, even if what you say were true, we could not do this. If we were to collapse each adjacent two regions into one, we would still end up with the same conclusion that for each source region, objects would be forwarded to one or two potential target regions. Let's say we divide the heap into 10 equal regions, then region 0 would logically only compact into the bottom of region 0. But region 1 might compact to region 0 or 1. Depending on how full region 0 becomes, region 2 might compact into region 0 and 1, or region 1 and 2. And so on.

rkennke avatar May 08 '23 13:05 rkennke

I meant the full-GCs of G1 and Shenandoah, which use multiple threads to do their work.

I see; regions in the compaction-queue might not be contiguous in heap-space, so to-regions will not necessarily be adjacent.

Thank you for the clarification.

And btw, even if what you say were true, we could not do this.

I meant keep the current from-region size and double only to-region size. (But my original assumption was wrong, so this doesn't work.)

albertnetymk avatar May 08 '23 13:05 albertnetymk

@tschatzl or anybody else: any concerns remaining with this PR? If not, I would integrate it later today (if the GHA are green).

rkennke avatar May 09 '23 10:05 rkennke

@tschatzl @fisk @coleenp @shipilev I've pushed a change that specializes all affected full-GC loops to get the flag-check out of the hot loops. This should get performance in the -UseAltGCForwarding paths back to normal, and show minimal effect (if any) in the +UseAltGCForwarding path. It is a little uglier though. If you prefer the clean and slightly slower version, let me know and I would back it out. Consider that once (after a few release cycles) the Lilliput stuff is stable and default, all that specialized stuff will disappear.. I guess the alternative would be to not put it under flag to begin with.

rkennke avatar May 22 '23 14:05 rkennke

Using the same Retain.java program that Aleksey posted earlier, I now get the following numbers:

Baseline: 286.9ms -AltGCForwarding: 286.3ms (-0.2%) +AltGCForwarding: 309.1ms (+7.7%)

rkennke avatar May 22 '23 19:05 rkennke