mmtk-openjdk icon indicating copy to clipboard operation
mmtk-openjdk copied to clipboard

Segmentation fault with ConcurrentImmix and h2o

Open wks opened this issue 2 months ago • 3 comments

In the current master branch (https://github.com/mmtk/mmtk-openjdk/commit/e350a01316b56172477906b62e07a2373232fe31), if we run the h2o benchmark from DaCapo Chopin repeatedly using ConcurrentImmix, it is very likely to crash due to segmentation fault.

The command line I am using is:

while MMTK_PLAN=ConcurrentImmix /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-fastdebug/jdk/bin/java -XX:MetaspaceSize=500M \
  -XX:+DisableExplicitGC \
  -server \
  -XX:+CrashOnOutOfMemoryError \
  -XX:+UseThirdPartyHeap \
  -Xms340M -Xmx340M \
  -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=compiler-directives/noc2.json \
  -jar dacapo-23.11-MR2-chopin.jar \
  -n 1 h2o; do true; done

And noc2.json is a compiler directive file:

[
    {
        "match": ["*.*"],
        "c2": {
            "Exclude": true
        }
    }
]

The command runs the h2o benchmark using the fastdebug build, repeatedly, with the C2 JIT compiler disabled. It will usually crash in one or two minutes with the following error message:

[2025-10-20T08:02:55Z INFO  mmtk::plan::concurrent::immix::global] FinalMark start
[2025-10-20T08:02:55Z INFO  mmtk::plan::concurrent::immix::global] FinalMark end
[2025-10-20T08:02:55Z INFO  mmtk::scheduler::scheduler] End of GC (37342/87040 pages, took 49 ms)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fac23771365, pid=159679, tid=159756
#
# JRE version: OpenJDK Runtime Environment (11.0.19) (fastdebug build 11.0.19-internal+0-adhoc.wks.openjdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 11.0.19-internal+0-adhoc.wks.openjdk, mixed mode, tiered, compressed oops, third-party gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x1371365]  Klass::method_at_vtable(int)+0x25
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %d %F" (or dumping to /home/wks/opt/dacapo/core.159679)
#
# An error report file with more information is saved as:
# /home/wks/opt/dacapo/hs_err_pid159679.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Current thread is 159756
Dumping core ...
Aborted                    (core dumped) MMTK_PLAN=ConcurrentImmix /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-fastdebug/jdk/bin/java -XX:MetaspaceSize=500M -XX:+DisableExplicitGC -server -XX:+CrashOnOutOfMemoryError -XX:+UseThirdPartyHeap -Xms340M -Xmx340M -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=compiler-directives/noc2.json -jar dacapo-23.11-MR2-chopin.jar -n 1 h2o

And here is hs_err_pid159679.log just in case it is useful.

Sometimes it may crash with some intermediate language dump:

[2025-10-20T08:10:09Z INFO  mmtk::plan::concurrent::immix::global] FinalMark start
[2025-10-20T08:10:09Z INFO  mmtk::plan::concurrent::immix::global] FinalMark end
[2025-10-20T08:10:09Z INFO  mmtk::scheduler::scheduler] End of GC (43722/87040 pages, took 117 ms)
implicit exception happened at 0x00007f4b24eb8ac2
Compiled method (c1)   47092 5036       1       water.Value::get (46 bytes)
 total in heap  [0x00007f4b24eb8810,0x00007f4b24eb8ee8] = 1752
 relocation     [0x00007f4b24eb8998,0x00007f4b24eb8a30] = 152
 main code      [0x00007f4b24eb8a40,0x00007f4b24eb8c60] = 544
 stub code      [0x00007f4b24eb8c60,0x00007f4b24eb8ce8] = 136
 oops           [0x00007f4b24eb8ce8,0x00007f4b24eb8cf0] = 8
 metadata       [0x00007f4b24eb8cf0,0x00007f4b24eb8d08] = 24
 scopes data    [0x00007f4b24eb8d08,0x00007f4b24eb8da0] = 152
 scopes pcs     [0x00007f4b24eb8da0,0x00007f4b24eb8ee0] = 320
 dependencies   [0x00007f4b24eb8ee0,0x00007f4b24eb8ee8] = 8
0 fast_aload_0
1 invokespecial 141 <water/Value.touch()V> 
  0   bci: 1    CounterData         count(8095)
4 fast_aaccess_0
5 fast_agetfield 75 <water/Value._pojo/Lwater/Freezable;> 
8 checkcast 4 <water/Iced>
  16  bci: 8    ReceiverTypeData    flags(1) count(0) nonprofiled_count(0) entries(2)
                                    'water/Job'(344 0.90)
                                    'water/fvec/NFSFileVec'(38 0.10)
11 astore_1
12 aload_1
13 ifnull 18
  72  bci: 13   BranchData          taken(38) displacement(32)
                                    not taken(8122)
...

I am still not sure whether the bug is in mmtk-core or the binding because we currently don't have another VM binding that supports ConcurrentImmix.

We can also observed this crash after refactoring the barrier implementations in mmtk-openjdk (https://github.com/mmtk/mmtk-openjdk/pull/332), so the crash should belong to something which that PR didn't change.

wks avatar Oct 20 '25 08:10 wks

Here are some more clues.

The PR https://github.com/mmtk/mmtk-core/pull/1400 accidentally made all GCs of ConcurrentImmix full GC, and it hid the bug so that it won't crash. After fixing (https://github.com/mmtk/mmtk-core/pull/1404), it can be reproduced again. It means the bug is related to concurrent GC, but not triggered by full GC in ConcurrentImmix.

The bug is still reproducible with the MMTK_THREADS=1 environment variable.

The bug is still reproducible with MMTK_NO_REFERENCE_TYPES=true. In this case, all Reference objects are considered strong, and therefore all soft/weak/phantomly reachable objects are part of SATB. The load_weak_reference barrier is therefore unnecessary. Since it still crashes, the bug may be related to the write barrier or other parts.

wks avatar Oct 20 '25 08:10 wks

Now I am certain that the concurrent GC is reclaiming live objects. The following patch clears the body of all objects in ImmixSpace that are not marked during Block::sweep using VO bits. It is only usable in VMs where the object start is always equal to the raw address of ObjectReference.

diff --git a/src/util/metadata/vo_bit/helper.rs b/src/util/metadata/vo_bit/helper.rs
index 4dd06459de..3aaa33e515 100644
--- a/src/util/metadata/vo_bit/helper.rs
+++ b/src/util/metadata/vo_bit/helper.rs
@@ -191,6 +191,24 @@ pub(crate) fn on_region_swept<VM: VMBinding, R: Region>(region: &R, is_occupied:
         VOBitUpdateStrategy::CopyFromMarkBits => {
             // In this strategy, we need to update the VO bits state after marking.
             if is_occupied {
+                let mut addr = region.start();
+                let mut in_dead = false;
+                while addr < region.end() {
+                    if let Some(obj) = vo_bit::is_vo_bit_set_for_addr(addr) {
+                        if VM::VMObjectModel::LOCAL_MARK_BIT_SPEC
+                            .is_marked::<VM>(obj, Ordering::Relaxed)
+                        {
+                            in_dead = false;
+                        } else {
+                            in_dead = true;
+                        }
+                    }
+                    if in_dead {
+                        unsafe { std::ptr::write::<u64>(addr.as_mut_ref::<u64>(), 0) };
+                    }
+                    addr += 8usize;
+                }
+
                 // If the block has live objects, copy the VO bits from mark bits.
                 vo_bit::bcopy_vo_bit_from_mark_bit::<VM>(region.start(), R::BYTES);
             } else {

After applying this patch, Immix will run normally, but ConcurrentImmix will crash after only three or four GCs, even with the C2 JIT compiler enabled. Now it usually crashes because the oop.klass field is 0.

wks avatar Oct 21 '25 08:10 wks

Due to a bug which will be fixed by https://github.com/mmtk/mmtk-core/pull/1412, no objects in the LOS has unlog bits set during concurrent marking, so the SATB barrier is not applied to any object in the LOS.

The following program can reproduce the bug in the master branch. If the VO bit is enabled (make OpenJDK with MMTK_VO_BIT=1), it will panic with "VO bit not set".

public class MakeArrays {
    // Large enough to get it allocated into the LOS.
    static int ARRAY_LEN = 10000;
    static int ARRAY_OF_ARRAYS_LEN = 1000;

    public static void main(String[] args) {
        int rounds = 1000000;

        Object[][] arrayOfArrays = new Object[ARRAY_OF_ARRAYS_LEN][];
        int cursor = 0;

        System.out.format("Do %d rounds...%n", rounds);

        for (int i = 0; i < rounds; i++) {
            Object[] ary = new Object[ARRAY_LEN];
            ary[0] = Integer.toString(i);
            if (arrayOfArrays[cursor] != null) {
                Object old = arrayOfArrays[cursor][0];
                if (!old.equals(Integer.toString(cursor))) {
                    throw new RuntimeException(String.format("Should be %d, but got %s", cursor, old));
                }
                ary[0] = old;
                // This detaches the `old` object from a large array.
                arrayOfArrays[cursor][0] = null;
            }
            arrayOfArrays[cursor] = ary;
            cursor = (cursor + 1) % ARRAY_OF_ARRAYS_LEN;
        }

        System.out.format("Done.%n");
    }
}

wks avatar Nov 04 '25 13:11 wks