mmtk-openjdk icon indicating copy to clipboard operation
mmtk-openjdk copied to clipboard

Refactor barrier implementations

Open wks opened this issue 3 months ago • 20 comments

We recently added another write barrier, the SATBBarrier. Being another barrier based on the per-object unlog bit, it shares much code with the ObjectBarrier. We refactor the write barriers to improve code sharing and clean up unused code.

We introduce abstract base classes MMTkUnlogBitBarrierSetRuntime, MMTkUnlogBitBarrierSetAssembler, MMTkUnlogBitBarrierSetC1 and MMTkUnlogBitBarrierSetC2, which both ObjectBarrier and SATBBarrier derive from. Those base classes include common parts related to unlog bit handling, including the fast paths for checking if the unlog bit is set, and also the generic implementation for both the pre and the post barriers if applicable.

Runtime:

  • Extracted a method for testing if the unlog bit is set for a given object.

Assembler:

  • Extracted a method for emitting the unlog bit checking fast path.
  • Extracted a method for generating both the pre and the post write barriers.
  • Simplified the fast path to use fewer instructions and fewer scratch registers.
  • Worked around an issue where the tmp1 and tmp2 temporary registers passed in from BarrierSetAssembler::store_at may overlap with dst.base() and dst_index(), respectively.
    • See https://bugs.openjdk.org/browse/JDK-8301371

C1:

  • Changed "runtime stubs" to be one per runtime function, not one per pre/post/refload barrier. Those "runtime stubs" are now assembled by MMTkBarrierSetAssembler instead of by MMTkObjectBarrierSetAssembler or MMTkSATBBarrierSetAssembler. The intention is that all barriers can call those functions.
  • Extracted a method for emitting the unlog bit checking fast path.
  • Extracted a method for emitting both the pre and the post write barriers and their slow-path code stub.
  • We no longer pass the slot and new_val to the runtime function in the slow path. Object-logging barriers only needs the obj argument for scanning its fields.
    • For this reason, we moved the C1 CodeStub for the pre/post barrier slow paths down from mmtkBarrierSetC1.hpp to mmtkUnlogBitBarrier.hpp (and unified the pre/post stub implementation) because it no longer needs the slot or new_val args, and is now specific to unlog bit barriers. If we introduce field-logging barriers, we will implement a different stub.
  • The C1 SATBBarrier no longer needs code patching. The code patching was intended for barriers that load from the field in the fast path using LIR instructions, such as G1's SATB barrier. Our object-logging barrier scans the fields in runtime functions (mmtk_object_reference_write_pre), not using LIR instructions.

C2:

  • Extracted a method for emitting the unlog bit checking fast path.
  • Extracted a method for emitting both the pre and the post write barriers.
  • We no longer pass the slot and new_val to the runtime function in the slow path.

Misc:

  • Removed the #define private public hack as we no longer need it.
  • Removed unused macros, dead code and irrelevant comments.
  • Changed the macro SOFT_REFERENCE_LOAD_BARRIER to a run-time flag mmtk_enable_reference_load_barrier settable by environment variable.

wks avatar Sep 22 '25 02:09 wks

I think we should specify in the type that this is an object remembering barrier. LXR's field remembering barrier would also be an "unlog-bit" barrier but it's a different implementation.

k-sareen avatar Sep 24 '25 01:09 k-sareen

I think we should specify in the type that this is an object remembering barrier. LXR's field remembering barrier would also be an "unlog-bit" barrier but it's a different implementation.

I chose the name MMTkUnlogBitBarrierSet because the mmtk-core simply calls the metadata "unlog bit" or just "log bit". But I think it is worth emphasizing that it is "object unlog bit" or "object remembering barrier" rather than "field unlog bit" or "field remembering barrier".

But I hesitate to change the name now because "MMTkObjectUnlogBitBarrier" and "MMTkObjectRememberingBarrier" are both similar to the current "MMTkObjectBarrier", and the longer name feels like it should be a subclass rather than superclass. It is also a bit confusing because both "ObjectBarrier" and "SATBBarrier" are object-logging. I think we should rename "ObjectBarrier" to a more meaningful name, such as "object modification barrier". We can discuss this in a group.

wks avatar Sep 24 '25 02:09 wks

https://github.com/mmtk/mmtk-openjdk/actions/runs/17941785053/job/51021271068?pr=332 One test failed due to "free(): corrupted unsorted chunks". One possibility is that it was because malloc happened to mmap its memory within MMTk's memory range, and MMTk somehow corrupted their memory. We should quarantine the entire MMTk memory range to prevent this from happening. It is also possible if a heap object contains a pointer to a malloc object, and the heap object was collected prematurely. We'll need more evidence.

wks avatar Sep 24 '25 02:09 wks

Another failure https://github.com/mmtk/mmtk-openjdk/actions/runs/17941785053/job/51021285329?pr=332 was a "duplicated edges" error in SemiSpace. It was already reported. See https://github.com/mmtk/mmtk-openjdk/issues/331

wks avatar Sep 24 '25 02:09 wks

I labelled this as ready for review.

There is one thing intentionally not done in this PR:

To be consistent with write barriers, MMTkBarrierSet{Assembler,C1,C2} should also override load_at and provide virtual methods object_reference_read_{pre_post} for subclasses to override. However, we currently don't have read barriers in MMTk core except SATBBarrier which only applies read barriers for weak references. We can't work out a nice interface for read barriers unless we have one in MMTk core, as it may be more complicated than write barriers. For example,

  • "loaded value barriers" is a post barrier, but it should have access to the loaded value, and can override the value read by the raw memory read. In other words, it needs to overwrite the result register (Assembler and C1), and/or may need to return a different register/node that holds the final result (C1 and C2)
  • "indirection barriers" may override the slot address to read from, and apply not only to reference fields, but value fields as well.

Because SATBBarrier is currently the only barrier (policy) that has a barrier (mechanism) for read accesses, we just let it override load_at directly. If we introduce another barrier, such as the field-logging SATB barrier or RC barrier, we may hoist some code to a shared file or superclasses.

wks avatar Sep 25 '25 06:09 wks

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

Most data show that there is no difference. But some cases, such as (build1, 1.8x, Immix) and (build2, 2.4x, StickyImmix) exhibited large slowdown together with large noise. They are caused by abnormal slowdowns in mutator time (time.other) in one or two invocations. This can be seen by plotting the time of each invocation separately.

  • (build1, 1.8x, Immix): https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&mmtk_gc^1^Immix&hfac^1^1892|40&XY^build^invocation&time.other
  • (build2, 2.4x, StickyImmix): https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&mmtk_gc^1^StickyImmix&hfac^1^2428|40&XY^build^invocation&time.other
  • (build2, 3.0x, Immix): https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&mmtk_gc^1^Immix&hfac^1^3023|40&XY^build^invocation&time.other

Since it affects both the master branch and this PR, I suspect it is caused by either

  • an existing bug in the master branch that caused abnormal slowdown in mutator, or
  • some anomalies on vole.moma that interrupted the testing processes at unpredictable times.

I'll run the experiment on a different moma machine to see if can be reproduced.

wks avatar Sep 26 '25 01:09 wks

The CPU frequency of vole.mama, the machine I ran tests on, was not locked to maximum. This may be the cause of the abnormal slowdown. I am running the same tests on another machine where CPU frequencies are set to maximum.

wks avatar Sep 26 '25 03:09 wks

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

Can you run all the benchmarks over the weekend?

qinsoon avatar Sep 26 '25 03:09 qinsoon

I reran the lusearch benchmark with the same binaries and same configuration on mole.moma (same machine configuration, but we have checked the CPU frequencies are almost always max, although some CPUs occasionally show 50% of the max CPU frequency in /proc/cpuinfo).

The results are similar. Overall, there is no significant evidence of performance improvement or regression.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^hfac&

And there are still random outliers in mutator time as shown below, in both build1 and build2. The two yellow spikes below are from Immix in build1 (master), and the three green spikes below are from ConcurrentImmix in build2 (this PR).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.other

And there is no spike in STW time, as shown below. This is strange because in lusearch, STW time is 25% to 200% of the mutator time, depending on the plan and the heap size. If the slowdown is caused by CPU frequency, it should increase the STW time in the same iterations by the same ratio, too. But there is no spike in STW time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.stw

wks avatar Sep 26 '25 16:09 wks

Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.

time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&

From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.

perf didn't show any significant bottleneck in build1-GenImmix.

wks avatar Sep 28 '25 03:09 wks

Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.

time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&

From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.

perf didn't show any significant bottleneck in build1-GenImmix.

If slot gets removed, does this still work with filed barriers? I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.

tianleq avatar Sep 28 '25 23:09 tianleq

If slot gets removed, does this still work with filed barriers?

No. It won't work with field barriers. That's why the methods are implemented in MMTkUnlogBitBarrierSet{Assembler,C1,C2} instead of MMTkBarrierSet{Assembler,C1,C2}. To use field barriers, we need to reimplement a new write barrier fast path anyway. If the fast path fails, it will need to call the slow path with the slot argument. We will do that in something like MMTkFieldUnlogBarrierSet{Assembler,C1,C2}.

The code blobs for calling runtime functions, i.e. MMTkBarrierSetC1::_object_reference_write_{pre,post,slow}_c1_runtime_code_blob still properly forward the src, slot and new_val arguments to the runtime functions as long as those arguments are provided, and should still work for field barriers.

I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.

That's right. If we don't need the slot address, we don't need to code patching. We will need code patching again to support field barriers.

wks avatar Sep 29 '25 02:09 wks

There are 3 tests failing in concurrent Immix.

qinsoon avatar Sep 29 '25 03:09 qinsoon

A side note, in C1, code patching is done on instruction instead of LIR. When code needs patching, the value of slot is incorrect. To support field barrier, we need to be conservative in this case, keeping track of all slots of the src

tianleq avatar Oct 02 '25 01:10 tianleq

This PR may have introduced bug in the SATB barrier. So far we have observed it crashing in avrora, h2o, lusearch, and xalan in the CI. It is hard to reproduce, but I managed to reproduce it locally in h2o. It always crashes after FinalMark, and it crashes at random locations, including both native code and Java code. It looks like some objects are prematurely reclaimed, possibly related to weak references.

wks avatar Oct 10 '25 14:10 wks

I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. https://github.com/mmtk/mmtk-openjdk/pull/332/commits/940c6915a8d39661cb78112afc97ef4b80cb1305

But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.

wks avatar Oct 16 '25 10:10 wks

I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691

But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.

I ran into the same issue when I was implementing the prototype.

I also ran into some C1 issues and it might be related. So when code needs patching, the value of the slot is invalid. The comment in the following is incorrect, after doing those the scratch register still contains an invalid slot. The only reason of doing this is to make the c1 LIR assembler happy, otherwise, it might trigger an assertion failure saying something like "expecting a register but found an address"

if (stub->patch_code != lir_patch_none) {
    // Patch
    assert(stub->scratch->is_single_cpu(), "must be");
    assert(stub->scratch->is_register(), "Precondition.");
    ce->mem2reg(stub->slot, stub->scratch, T_OBJECT, stub->patch_code, stub->info, false /*wide*/, false /*unaligned*/);
    // Now stub->scratch contains the pre_val instead of the slot address
    // So the following is to load the slot address into scrach register
    // Resolve address 
    auto masm = ce->masm();
    LIR_Address* addr = stub->slot->as_address_ptr();
    Address from_addr = ce->as_Address(addr);
    __ lea(stub->scratch->as_register(), from_addr);
    // Store parameter
    ce->store_parameter(stub->scratch->as_pointer_register(), 1);
  }

tianleq avatar Oct 17 '25 00:10 tianleq

I can't reproduce the bug with -XX:-UseCompressedOops. Maybe it is related to compressed pointer.

wks avatar Oct 17 '25 02:10 wks

After updating mmtk-core to revision https://github.com/mmtk/mmtk-core/commit/1ffa5b325a07d4fe99fe7e7008b295f8febea407 (PR https://github.com/mmtk/mmtk-core/pull/1400), I can't reproduce the crash. It was still reproducible before that commit. I don't understand it. OpenJDK doesn't use alloc_with_options at all.

wks avatar Oct 17 '25 10:10 wks

I managed to reproduce the crash on the master branch. It seems to be more reproducible if we disable C2.

(update: I created an issue for this: https://github.com/mmtk/mmtk-openjdk/issues/334, and it contains steps for reproducing it on the master branch)

So it is an existing bug. It seems to be more reproducible on the CI server after this PR is merged. But locally I can reproduce the bug on master as often as with this PR.

wks avatar Oct 20 '25 08:10 wks

The bug https://github.com/mmtk/mmtk-openjdk/issues/334 has been fixed, and this PR no longer crashes when running h2o.

wks avatar Nov 06 '25 13:11 wks