jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier

Open tschatzl opened this issue 9 months ago • 41 comments

Hi all,

please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.

The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.

Current situation

With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.

The main reason for the current barrier is how g1 implements concurrent refinement:

  • g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
  • For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
  • Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.

These tasks require the current barrier to look as follows for an assignment x.a = y in pseudo code:

// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) == young_card) goto done;  // write to young gen check
StoreLoad;                // synchronize
if (card(@x.a) == dirty_card) goto done;

*card(@x.a) = dirty

// Card tracking
enqueue(card-address(@x.a)) into thread-local-dcq;
if (thread-local-dcq is not full) goto done;

call runtime to move thread-local-dcq into dcqs

done:

Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.

The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.

There are several papers showing that this barrier alone can decrease throughput by 10-20% (Yang12), which is corroborated by some benchmarks (see links).

The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a second card table ("refinement table"). The second card table also replaces the dirty card queue.

In that scheme the fine-grained synchronization is unnecessary because mutator and refinement threads always write to different memory areas (and no concurrent write where an update can be lost can occur). This removes the necessity for synchronization for every reference write. Also no card enqueuing is required any more.

Only the filters and the card mark remain.

How this works

In the beginning both the card table and the refinement table are completely unmarked (contain "clean" cards). The mutator dirties the card table, until G1 heuristics think that a significant enough amount of cards were dirtied based on what is allocated for scanning them during the garbage collection.

At that point, the card table and the refinement table are exchanged "atomically" using handshakes. The mutator keeps dirtying the (the previous, clean refinement table which is now the) card table, while the refinement threads look for and refine dirty cards on the refinement table as before.

Refinement of cards is very similar to before: if an interesting reference in a dirty card has been found, G1 records it in appropriate remembered sets. In this implementation there is an exception for references to the current collection set (typically young gen) - the refinement threads redirty that card on the card table with a special to-collection-set value.

This is valid because races with the mutator for that write do not matter - the entire card will eventually be rescanned anyway, regardless of whether it ends up as dirty or to-collection-set. The advantage of marking to-collection-set cards specially is that the next time the card tables are swapped, the refinement threads will not re-refine them on the assumption that that reference to the collection set will not change. This decreases refinement work substantially.

If refinement gets interrupted by GC, the refinement table will be merged with the card table before card scanning, which works as before.

New barrier pseudo-code for an assignment x.a = y:

// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) != clean_card) goto done;  // skip already non-clean cards
*card(@x.a) = dirty

This is basically the Serial/Parallel GC barrier with additional filters to keep the number of dirty cards as little as possible.

A few more comments about the barrier:

  • the barrier now loads the card table base offset from a thread local instead of inlining it. This is necessary for this mechanism to work as the card table to dirty changes over time, and may even be faster on some architectures (code size), and some architectures already do.
  • all existing pre-filters were kept. Benchmarks showed some significant regressions wrt to pause times and even throughput compared to G1 in master. Using the Parallel GC barrier (just the dirty card write) would be possible, and further investigation on stripping parts will be made as follow-up.
  • the final check tests for non-clean cards to avoid overwriting existing cards, in particular the "to-collection set" cards described above.

Current G1 marks the cards corresponding to young gen regions as all "young" so that the original barrier could potentially avoid the StoreLoad. This implementation removes this facility (which might be re-introduced later), but measurements showed that pre-dirtying the young generation region's cards as "dirty" (g1 does not need to use an extra "young" value) did not yield any measurable performance difference.

Refinement process

The goal of the refinement (threads) is to make sure that the number of cards to scan in the garbage collection is below a particular threshold.

The prototype changes the refinement threads into a single control thread and a set of (refinement) worker threads. Differently to the previous implementation, the control thread does not do any refinement, but only executes the heuristics to start a calculated amount of worker threads and tracking refinement progress.

The refinement trigger is based on current known number of pending (i.e. dirty) cards on the card table and a pending card generation rate, fairly similarly to the previous algorithm. After the refinement control thread determines that it is time to do refinement, it starts the following sequence:

  1. Swap the card table. This consists of several steps:
    1. Swap the global card table - the global card table pointer is swapped; newly created threads and runtime calls will eventually use the new values, at the latest after the next two steps.
    2. Update the pointers in all JavaThread's TLS storage to the new card table pointer using a handshake operation
    3. Update the pointers in the GC thread's TLS storage to the new card table pointer using the SuspendibleThreadSet mechanism
  2. Snapshot the heap - determine the extent of work needed for all regions where the refinement threads need to do some work on the refinement table (the previous card table). The snapshot stores the work progress for each region so that work can be interrupted and continued at any time. This work either consists of refinement of the particular card (old generation regions) or clearing the cards (next collection set/young generation regions).
  3. Sweep the refinement table by activating the refinement worker threads. The threads refine dirty cards using the heap snapshot where worker threads claim parts of regions to process.
    • Cards with references to the young generation are not added to the young generation's card based remembered set. Instead these cards are marked as to-collection-set in the card table and any remaining refinement of that card skipped.
    • If refinement encounters a card that is already marked as to-collection-set it is not refined and re-marked as to-collection-set on the card table .
    • During refinement, the refinement table is also cleared (in bulk for collection set regions as they do not need any refinement, and in other regions as they are refined for the non-clean cards).
    • Dirty cards within unparsable heap areas are forwarded to/redirtied on the card table as is.
  4. Completion work, mostly statistics.

If the work is interrupted by a non-garbage collection synchronization point, work is suspended temporarily and resumed later using the heap snapshot.

After the refinement process the refinement table is all-clean again and ready to be swapped again.

Garbage collection pause changes

Since a garbage collection (young or full gc) pause may occur at any point during the refinement process, the garbage collection needs some compensating work for the not yet swept parts of the refinement table.

Note that this situation is very rare, and the heuristics try to avoid that, so in most cases nothing needs to be done as the refinement table is all clean.

If this happens, young collections add a new phase called Merge Refinement Table in the garbage collection pause right before the Merge Heap Roots phase. This compensating phase does the following:

  1. (Optional) Snapshot the heap if not done yet (if the process has been interrupted between state 1 and 3 of the refinement process)
  2. Merge the refinement table into the card table - in this step the dirty cards of interesting regions are
  3. Completion work (statistics)

If a full collection interrupts concurrent refinement, the refinement table is simply cleared and all dirty cards thrown away.

A garbage collection generates new cards (e.g. references from promoted objects into the young generation) on the refinement table. This acts similarly to the extra DCQS used to record these interesting references/cards and redirty the card table using them in the previous implementation. G1 swaps the card tables at the end of the collection to keep the post-condition of the refinement table being all clean (and any to-be-refined cards on the card table) at the end of garbage collection.

Performance metrics

Following is an overview of the changes in behavior. Some numbers are provided in the CR in the first comment.

Native memory usage

The refinement table takes an additional 0.2% of the Java heap size of native memory compared to JDK 21 and above (in JDK 21 we removed one card table sized data structure, so this is a non-issue when updating from before).

Some of that additional memory usage is automatically reclaimed by removing the dirty card queues. Additional memory is reclaimed by managing the cards containing to-collection-set references on the card table by dropping the explicit remembered sets for young generation completely and any remembered set entries which would otherwise be duplicated into the other region's remembered sets.

In some applications/benchmarks these gains completely offset the additional card table, however most of the time this is not the case, particularly for throughput applications currently. It is possible to allocate the refinement table lazily, which means that since these applications often do not need any concurrent refinement, there is no overhead at all but actually a net reduction of native memory usage. This is not implemented in this prototype.

Latency ("Pause times")

Not affected or slightly better. Pause times decrease due to a shorter "Merge remembered sets" phase due to no work required for the remembered sets for the young generation - they are always already on the card table!

However merging of the refinement table into the card table is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. Since this work is linearly scanning some memory, this is embarassingly parallel too.

The cards created during garbage collection do not need to be redirtied, so that phase has also been removed.

The card table swap is based on predictions for mutator card dirtying rate and refinement rate as before, and the policy is actually fairly similar to before. It is still rather aggressive, but in most cases takes less cpu resources than the one before, mostly because refining takes less cpu time. Many applications do not do any refinement at all like before. More investigation could be done to improve this in the future.

Throughput

This change always increases throughput in my measurements, depending on benchmark/application it may not actually show up in scores though.

Due to the pre-barrier and the additional filters in the barrier G1 is still slower than Parallel on raw throughput benchmarks, but is typically somewhere half-way to Parallel GC or closer.

Code Size

Code size measurements on DaCapo benchmarks by @robcasloz showed that this change decreases code size by around 5%.

Platform support

Since the post write barrier changed, additional work for some platforms is required to allow this change to proceed. At this time all work for all platforms is done, but needs testing

  • GraalVM (contributed by the GraalVM team)
  • S390 (contributed by A. Kumar from IBM)
  • PPC (contributed by M. Doerr, from SAP)
  • ARM (should work, HelloWorld compiles and runs)
  • RISCV (should work, HelloWorld compiles and runs)
  • x86 (should work, build/HelloWorld compiles and runs)

None of the above mentioned platforms implement the barrier method to write cards for a reference array (aarch64 and x64 are fully implemented), they call the runtime as before. I believe it is doable fairly easily now with this simplified barrier for some extra performance, but not necessary.

Alternatives

The JEP text extensively discusses alternatives.

Reviewing

The change can be roughly divided in these fairly isolated parts

  • platform specific changes to the barrier
  • refinement and refinement control thread changes; this is best reviewed starting from the G1ConcurrentRefineThread::run_service method
  • changes to garbage collection: merge_refinement_table() in g1RemSet.cpp
  • policy modifications are typically related to code around the calls to G1Policy::record_dirtying_stats.

Further information is available in the JEP draft; there is also an a bit more extensive discussion of the change on my blog.

Some additional comments:

  • the pre-marking of young generation cards has been removed. Benchmarks did not show any significant difference either way. To me this makes somewhat sense because the entire young gen will quickly get marked anyway. I.e. one only saves a single additional card table write (for every card). With the old barrier the costs for a card table mark has been much higher.
  • G1 sets UseCondCardMark to true by default. The conditional card mark corresponds to the third filter in the write barrier now, and since I decided to keep all filters for this change, it makes sense to directly use this mechanism.

If there are any questions, feel free to ask.

Testing: tier1-7 (multiple tier1-7, tier1-8 with slightly older versions)

Thanks, Thomas


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [ ] Change requires CSR request JDK-8352139 to be approved
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issues

  • JDK-8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier (Enhancement - P4)
  • JDK-8352139: Remove unused G1UpdateBufferSize global variable (CSR)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739
$ git checkout pull/23739

Update a local copy of the PR:
$ git checkout pull/23739
$ git pull https://git.openjdk.org/jdk.git pull/23739/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23739

View PR using the GUI difftool:
$ git pr show -t 23739

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23739.diff

Using Webrev

Link to Webrev Comment

tschatzl avatar Feb 23 '25 18:02 tschatzl

:wave: Welcome back tschatzl! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Feb 23 '25 18:02 bridgekeeper[bot]

/contributor add @tstuefe /contributor add @offamitkumar

tschatzl avatar Feb 23 '25 18:02 tschatzl

@tschatzl This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8342382: Implement JEP 522: G1 GC: Improve Throughput by Reducing Synchronization

Co-authored-by: Amit Kumar <[email protected]>
Co-authored-by: Martin Doerr <[email protected]>
Co-authored-by: Carlo Refice <[email protected]>
Co-authored-by: Fei Yang <[email protected]>
Reviewed-by: iwalulya, rcastanedalo, aph, ayang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 7 new commits pushed to the master branch:

  • 0ba4141cb12414c08be88b37ea2a163aacbfa7de: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java
  • e8db14f584fa92db170e056bc68074ccabae82c9: 8349910: Implement JEP 517: HTTP/3 for the HTTP Client API
  • 433d2ec534bbf6ec08157c976b567b81b748b128: 8367409: G1: Remove unused G1MonotonicArena::Segment::copy_to()
  • ... and 4 more: https://git.openjdk.org/jdk/compare/5efaa9970ace463f7d9bcd8f4028b1d60665cfad...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Feb 23 '25 18:02 openjdk[bot]

@tschatzl Contributor Thomas Stuefe <[email protected]> successfully added.

openjdk[bot] avatar Feb 23 '25 18:02 openjdk[bot]

@tschatzl Contributor Amit Kumar <[email protected]> successfully added.

openjdk[bot] avatar Feb 23 '25 18:02 openjdk[bot]

@tschatzl The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Feb 23 '25 18:02 openjdk[bot]

/contributor add @c-refice

tschatzl avatar Feb 23 '25 19:02 tschatzl

@tschatzl @c-refice was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

openjdk[bot] avatar Feb 23 '25 19:02 openjdk[bot]

in this pr you've wrote

if (region(@x.a) != region(y)) goto done; // same region check

but on https://tschatzl.github.io/2025/02/21/new-write-barriers.html you wrote:

(1)  if (region(x.a) == region(y)) goto done;    // Ignore references within the same region/area

i guess the second one is correct

tarsa avatar Feb 23 '25 19:02 tarsa

@tschatzl I did not contribute the ppc port. Did you mean @TheRealMDoerr or @reinrich ?

tstuefe avatar Feb 24 '25 06:02 tstuefe

/contributor remove @tstuefe /contributor add @TheRealMDoerr

tschatzl avatar Feb 24 '25 08:02 tschatzl

@tschatzl Contributor Thomas Stuefe <[email protected]> successfully removed.

openjdk[bot] avatar Feb 24 '25 08:02 openjdk[bot]

@tschatzl Contributor Martin Doerr <[email protected]> successfully added.

openjdk[bot] avatar Feb 24 '25 08:02 openjdk[bot]

/contributor add Carlo Refice [email protected]

tschatzl avatar Feb 24 '25 08:02 tschatzl

@tschatzl Contributor Carlo Refice <[email protected]> successfully added.

openjdk[bot] avatar Feb 24 '25 08:02 openjdk[bot]

/label add hotspot-gc

tschatzl avatar Feb 24 '25 10:02 tschatzl

@tschatzl The hotspot-gc label was successfully added.

openjdk[bot] avatar Feb 24 '25 10:02 openjdk[bot]

Webrevs

mlbridge[bot] avatar Feb 25 '25 15:02 mlbridge[bot]

I don't see any failure on s390x. Tier1 test looks good.

offamitkumar avatar Mar 03 '25 14:03 offamitkumar

I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR:

#  Internal Error (/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56), pid=19044, tid=19159
#  guarantee(!failures) failed: there should not have been any failures
...
V  [libjvm.so+0xb6e988]  G1CardTable::verify_region(MemRegion, unsigned char, bool)+0x3b8  (g1CardTable.cpp:56)
V  [libjvm.so+0xc3a10c]  G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c  (g1RemSet.cpp:1048)
V  [libjvm.so+0xb7a80c]  G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0x9c  (g1CollectedHeap.cpp:2059)
V  [libjvm.so+0xc49fe8]  G1MergeHeapRootsTask::work(unsigned int)+0x708  (g1RemSet.cpp:1225)
V  [libjvm.so+0x19597bc]  WorkerThread::run()+0x98  (workerThread.cpp:69)
V  [libjvm.so+0x1824510]  Thread::call_run()+0xac  (thread.cpp:231)
V  [libjvm.so+0x13b3994]  thread_native_entry(Thread*)+0x130  (os_linux.cpp:877)
C  [libpthread.so.0+0x875c]  start_thread+0x18c

TheRealMDoerr avatar Mar 04 '25 10:03 TheRealMDoerr

I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR:

#  Internal Error (/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56), pid=19044, tid=19159
#  guarantee(!failures) failed: there should not have been any failures
...
V  [libjvm.so+0xb6e988]  G1CardTable::verify_region(MemRegion, unsigned char, bool)+0x3b8  (g1CardTable.cpp:56)
V  [libjvm.so+0xc3a10c]  G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c  (g1RemSet.cpp:1048)
V  [libjvm.so+0xb7a80c]  G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0x9c  (g1CollectedHeap.cpp:2059)
V  [libjvm.so+0xc49fe8]  G1MergeHeapRootsTask::work(unsigned int)+0x708  (g1RemSet.cpp:1225)
V  [libjvm.so+0x19597bc]  WorkerThread::run()+0x98  (workerThread.cpp:69)
V  [libjvm.so+0x1824510]  Thread::call_run()+0xac  (thread.cpp:231)
V  [libjvm.so+0x13b3994]  thread_native_entry(Thread*)+0x130  (os_linux.cpp:877)
C  [libpthread.so.0+0x875c]  start_thread+0x18c

I will try to reproduce. Thanks.

tschatzl avatar Mar 04 '25 10:03 tschatzl

I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR:

Fixed.

tschatzl avatar Mar 08 '25 19:03 tschatzl

Tier1-3 test good on linux-riscv64 platform. And I have prepared an add-on change which implements the barrier method to write cards for a reference array for this platform. Do you want to have it in this PR? Thanks. 23739-riscv-addon.txt

RealFYang avatar Mar 11 '25 03:03 RealFYang

/contributor add @RealFYang

tschatzl avatar Mar 11 '25 09:03 tschatzl

@tschatzl Contributor Fei Yang <[email protected]> successfully added.

openjdk[bot] avatar Mar 11 '25 09:03 openjdk[bot]

Tier1-3 test good on linux-riscv64 platform. And I have prepared an add-on change which implements the barrier method to write cards for a reference array for this platform. Do you want to have it in this PR? Thanks.

I added your changes, thank you!

tschatzl avatar Mar 11 '25 09:03 tschatzl

Commit https://github.com/openjdk/jdk/pull/23739/commits/786111735c306583af5bc75f7653f0da67d52adb fixes an issue with full gc interrupting refinement while the global card table and the JavaThread's card table changes.

Testing: tier1-7 with changes, tier1-5 with changes stressing refinement similar to the ones added to the new test.

The new variant of ArrayJuggle2 fails >50% of all times in our CI without the patch (verified 700 or so executions of that not failing with patch).

tschatzl avatar Mar 13 '25 14:03 tschatzl

Commit https://github.com/openjdk/jdk/pull/23739/commits/f419556e9177ecf9fbf22e606dd6c1b850f4330f fixes the failing compiler tests that check whether the compiler emits the correct object graph. Occurs after merging with mainline that significantly reduces total barrier cost calculation.

tschatzl avatar Mar 19 '25 13:03 tschatzl

Commit https://github.com/openjdk/jdk/pull/23739/commits/5e76a516c848e75f56e966a1ffe4115b1dce786c implements the change to make young gen length revising independent of the refinement control thread.

Infrastructure to determine currently available number of bytes for allocation and determining the next time the particular task should be redone is shared. It may be distributed across a bit more methods than I would prefer, but particularly the refinement control thread wants to reuse and keep some intermediate results (to not be required to get the Heap_lock again basically).

I did not have a good reason to make the heuristic to determine the time to the next action different for both, so they are basically the same.

There is some pre-existing problem that the minimum time for re-doing the work is ~50ms. That might be too short in some cases, but then again, if you have that short of a GC interval it may not be very useful to e.g. revise young gen length anyway.

I think with this change all current concerns are addressed.

tschatzl avatar Mar 20 '25 09:03 tschatzl

This PR needs an update for x86 platforms when merging: g1BarrierSetAssembler_x86.cpp:117:6: error: 'class MacroAssembler' has no member named 'get_thread'

TheRealMDoerr avatar Apr 09 '25 22:04 TheRealMDoerr