Refactor barrier implementations
We recently added another write barrier, the SATBBarrier. Being another barrier based on the per-object unlog bit, it shares much code with the ObjectBarrier. We refactor the write barriers to improve code sharing and clean up unused code.
We introduce abstract base classes MMTkUnlogBitBarrierSetRuntime, MMTkUnlogBitBarrierSetAssembler, MMTkUnlogBitBarrierSetC1 and MMTkUnlogBitBarrierSetC2, which both ObjectBarrier and SATBBarrier derive from. Those base classes include common parts related to unlog bit handling, including the fast paths for checking if the unlog bit is set, and also the generic implementation for both the pre and the post barriers if applicable.
Runtime:
- Extracted a method for testing if the unlog bit is set for a given object.
Assembler:
- Extracted a method for emitting the unlog bit checking fast path.
- Extracted a method for generating both the pre and the post write barriers.
- Simplified the fast path to use fewer instructions and fewer scratch registers.
- Worked around an issue where the
tmp1andtmp2temporary registers passed in fromBarrierSetAssembler::store_atmay overlap withdst.base()anddst_index(), respectively.- See https://bugs.openjdk.org/browse/JDK-8301371
C1:
- Changed "runtime stubs" to be one per runtime function, not one per pre/post/refload barrier. Those "runtime stubs" are now assembled by
MMTkBarrierSetAssemblerinstead of byMMTkObjectBarrierSetAssemblerorMMTkSATBBarrierSetAssembler. The intention is that all barriers can call those functions. - Extracted a method for emitting the unlog bit checking fast path.
- Extracted a method for emitting both the pre and the post write barriers and their slow-path code stub.
- We no longer pass the
slotandnew_valto the runtime function in the slow path. Object-logging barriers only needs theobjargument for scanning its fields.- For this reason, we moved the C1 CodeStub for the pre/post barrier slow paths down from
mmtkBarrierSetC1.hpptommtkUnlogBitBarrier.hpp(and unified the pre/post stub implementation) because it no longer needs theslotornew_valargs, and is now specific to unlog bit barriers. If we introduce field-logging barriers, we will implement a different stub.
- For this reason, we moved the C1 CodeStub for the pre/post barrier slow paths down from
- The C1 SATBBarrier no longer needs code patching. The code patching was intended for barriers that load from the field in the fast path using LIR instructions, such as G1's SATB barrier. Our object-logging barrier scans the fields in runtime functions (
mmtk_object_reference_write_pre), not using LIR instructions.
C2:
- Extracted a method for emitting the unlog bit checking fast path.
- Extracted a method for emitting both the pre and the post write barriers.
- We no longer pass the
slotandnew_valto the runtime function in the slow path.
Misc:
- Removed the
#define private publichack as we no longer need it. - Removed unused macros, dead code and irrelevant comments.
- Changed the macro
SOFT_REFERENCE_LOAD_BARRIERto a run-time flagmmtk_enable_reference_load_barriersettable by environment variable.
I think we should specify in the type that this is an object remembering barrier. LXR's field remembering barrier would also be an "unlog-bit" barrier but it's a different implementation.
I think we should specify in the type that this is an object remembering barrier. LXR's field remembering barrier would also be an "unlog-bit" barrier but it's a different implementation.
I chose the name MMTkUnlogBitBarrierSet because the mmtk-core simply calls the metadata "unlog bit" or just "log bit". But I think it is worth emphasizing that it is "object unlog bit" or "object remembering barrier" rather than "field unlog bit" or "field remembering barrier".
But I hesitate to change the name now because "MMTkObjectUnlogBitBarrier" and "MMTkObjectRememberingBarrier" are both similar to the current "MMTkObjectBarrier", and the longer name feels like it should be a subclass rather than superclass. It is also a bit confusing because both "ObjectBarrier" and "SATBBarrier" are object-logging. I think we should rename "ObjectBarrier" to a more meaningful name, such as "object modification barrier". We can discuss this in a group.
https://github.com/mmtk/mmtk-openjdk/actions/runs/17941785053/job/51021271068?pr=332
One test failed due to "free(): corrupted unsorted chunks". One possibility is that it was because malloc happened to mmap its memory within MMTk's memory range, and MMTk somehow corrupted their memory. We should quarantine the entire MMTk memory range to prevent this from happening. It is also possible if a heap object contains a pointer to a malloc object, and the heap object was collected prematurely. We'll need more evidence.
Another failure https://github.com/mmtk/mmtk-openjdk/actions/runs/17941785053/job/51021285329?pr=332 was a "duplicated edges" error in SemiSpace. It was already reported. See https://github.com/mmtk/mmtk-openjdk/issues/331
I labelled this as ready for review.
There is one thing intentionally not done in this PR:
To be consistent with write barriers, MMTkBarrierSet{Assembler,C1,C2} should also override load_at and provide virtual methods object_reference_read_{pre_post} for subclasses to override. However, we currently don't have read barriers in MMTk core except SATBBarrier which only applies read barriers for weak references. We can't work out a nice interface for read barriers unless we have one in MMTk core, as it may be more complicated than write barriers. For example,
- "loaded value barriers" is a post barrier, but it should have access to the loaded value, and can override the value read by the raw memory read. In other words, it needs to overwrite the result register (Assembler and C1), and/or may need to return a different register/node that holds the final result (C1 and C2)
- "indirection barriers" may override the slot address to read from, and apply not only to reference fields, but value fields as well.
Because SATBBarrier is currently the only barrier (policy) that has a barrier (mechanism) for read accesses, we just let it override load_at directly. If we introduce another barrier, such as the field-logging SATB barrier or RC barrier, we may hoist some code to a shared file or superclasses.
Can you run benchmarks with
GenImmixto make sure the changes do not affect performance?
Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&
Most data show that there is no difference. But some cases, such as (build1, 1.8x, Immix) and (build2, 2.4x, StickyImmix) exhibited large slowdown together with large noise. They are caused by abnormal slowdowns in mutator time (time.other) in one or two invocations. This can be seen by plotting the time of each invocation separately.
-
(build1, 1.8x, Immix): https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&mmtk_gc^1^Immix&hfac^1^1892|40&XY^build^invocation&time.other -
(build2, 2.4x, StickyImmix): https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&mmtk_gc^1^StickyImmix&hfac^1^2428|40&XY^build^invocation&time.other -
(build2, 3.0x, Immix): https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&mmtk_gc^1^Immix&hfac^1^3023|40&XY^build^invocation&time.other
Since it affects both the master branch and this PR, I suspect it is caused by either
- an existing bug in the master branch that caused abnormal slowdown in mutator, or
- some anomalies on vole.moma that interrupted the testing processes at unpredictable times.
I'll run the experiment on a different moma machine to see if can be reproduced.
The CPU frequency of vole.mama, the machine I ran tests on, was not locked to maximum. This may be the cause of the abnormal slowdown. I am running the same tests on another machine where CPU frequencies are set to maximum.
Can you run benchmarks with
GenImmixto make sure the changes do not affect performance?Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.
Can you run all the benchmarks over the weekend?
I reran the lusearch benchmark with the same binaries and same configuration on mole.moma (same machine configuration, but we have checked the CPU frequencies are almost always max, although some CPUs occasionally show 50% of the max CPU frequency in /proc/cpuinfo).
The results are similar. Overall, there is no significant evidence of performance improvement or regression.
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^hfac&
And there are still random outliers in mutator time as shown below, in both build1 and build2. The two yellow spikes below are from Immix in build1 (master), and the three green spikes below are from ConcurrentImmix in build2 (this PR).
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.other
And there is no spike in STW time, as shown below. This is strange because in lusearch, STW time is 25% to 200% of the mutator time, depending on the plan and the heap size. If the slowdown is caused by CPU frequency, it should increase the STW time in the same iterations by the same ratio, too. But there is no spike in STW time.
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.stw
Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&
time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.
time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&
From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&
If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.
perf didn't show any significant bottleneck in build1-GenImmix.
Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.
time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.
time.other shows no obvious changes, too. There is one noisy benchmark
sunflowwhich shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.
If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the
slotandnew_valarguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.
perfdidn't show any significant bottleneck in build1-GenImmix.
If slot gets removed, does this still work with filed barriers? I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.
If
slotgets removed, does this still work with filed barriers?
No. It won't work with field barriers. That's why the methods are implemented in MMTkUnlogBitBarrierSet{Assembler,C1,C2} instead of MMTkBarrierSet{Assembler,C1,C2}. To use field barriers, we need to reimplement a new write barrier fast path anyway. If the fast path fails, it will need to call the slow path with the slot argument. We will do that in something like MMTkFieldUnlogBarrierSet{Assembler,C1,C2}.
The code blobs for calling runtime functions, i.e. MMTkBarrierSetC1::_object_reference_write_{pre,post,slow}_c1_runtime_code_blob still properly forward the src, slot and new_val arguments to the runtime functions as long as those arguments are provided, and should still work for field barriers.
I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.
That's right. If we don't need the slot address, we don't need to code patching. We will need code patching again to support field barriers.
There are 3 tests failing in concurrent Immix.
A side note, in C1, code patching is done on instruction instead of LIR. When code needs patching, the value of slot is incorrect. To support field barrier, we need to be conservative in this case, keeping track of all slots of the src
This PR may have introduced bug in the SATB barrier. So far we have observed it crashing in avrora, h2o, lusearch, and xalan in the CI. It is hard to reproduce, but I managed to reproduce it locally in h2o. It always crashes after FinalMark, and it crashes at random locations, including both native code and Java code. It looks like some objects are prematurely reclaimed, possibly related to weak references.
I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. https://github.com/mmtk/mmtk-openjdk/pull/332/commits/940c6915a8d39661cb78112afc97ef4b80cb1305
But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.
I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691
But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.
I ran into the same issue when I was implementing the prototype.
I also ran into some C1 issues and it might be related. So when code needs patching, the value of the slot is invalid. The comment in the following is incorrect, after doing those the scratch register still contains an invalid slot. The only reason of doing this is to make the c1 LIR assembler happy, otherwise, it might trigger an assertion failure saying something like "expecting a register but found an address"
if (stub->patch_code != lir_patch_none) {
// Patch
assert(stub->scratch->is_single_cpu(), "must be");
assert(stub->scratch->is_register(), "Precondition.");
ce->mem2reg(stub->slot, stub->scratch, T_OBJECT, stub->patch_code, stub->info, false /*wide*/, false /*unaligned*/);
// Now stub->scratch contains the pre_val instead of the slot address
// So the following is to load the slot address into scrach register
// Resolve address
auto masm = ce->masm();
LIR_Address* addr = stub->slot->as_address_ptr();
Address from_addr = ce->as_Address(addr);
__ lea(stub->scratch->as_register(), from_addr);
// Store parameter
ce->store_parameter(stub->scratch->as_pointer_register(), 1);
}
I can't reproduce the bug with -XX:-UseCompressedOops. Maybe it is related to compressed pointer.
After updating mmtk-core to revision https://github.com/mmtk/mmtk-core/commit/1ffa5b325a07d4fe99fe7e7008b295f8febea407 (PR https://github.com/mmtk/mmtk-core/pull/1400), I can't reproduce the crash. It was still reproducible before that commit. I don't understand it. OpenJDK doesn't use alloc_with_options at all.
I managed to reproduce the crash on the master branch. It seems to be more reproducible if we disable C2.
(update: I created an issue for this: https://github.com/mmtk/mmtk-openjdk/issues/334, and it contains steps for reproducing it on the master branch)
So it is an existing bug. It seems to be more reproducible on the CI server after this PR is merged. But locally I can reproduce the bug on master as often as with this PR.
The bug https://github.com/mmtk/mmtk-openjdk/issues/334 has been fixed, and this PR no longer crashes when running h2o.