pyroscope-java SIGSEGV on Java 21 / aarch64

Hey folks!

We've been running Pyroscope v0.12.2 within Trino, and after a recent upgrade to JVM 21 we started getting SIGSEGV errors.

Errors look like:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffa989b478, pid=1, tid=3987
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.1+12 (21.0.1+12) (build 21.0.1+12-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (21.0.1+12-LTS, mixed mode, sharing, tiered, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x6ce478]  frame::sender_for_entry_frame(RegisterMap*) const+0x128
#
# Core dump will be written. Default location: /data/trino/core.1
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Xmx104857M -javaagent:/app/libs/jmx_prometheus_javaagent-0.20.0.jar=9000:/var/lib/trino/prometheus-exporter/prometheus-exporter-config.yaml -Xlog:gc* -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+ExitOnOutOfMemoryError -Djdk.attach.allowAttachSelf=true -XX:ReservedCodeCacheSize=512M -XX:PerMethodRecompilationCutoff=10000 -XX:PerBytecodeRecompilationCutoff=10000 -Djdk.nio.maxCachedBufferSize=2000000 -XX:+UnlockDiagnosticVMOptions -XX:+UseAESCTRIntrinsics -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 -XX:+UseTransparentHugePages -XX:G1ReservePercent=35 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/heapdumps/ -XX:ErrorFile=/heapdumps/hs_err.logpaid-20u.hprof -Dnode.id=paid-20u-kynty-1-coordinator-0-1704338288 -Dnode.environment=production -Dnode.data-dir=/data/trino -Dplugin.dir=/usr/lib/trino/plugin -Dlog.levels-file=/etc/trino/..2024_01_04_03_17_36.2507751579/log.properties -Dconfig=/etc/trino/..2024_01_04_03_17_36.2507751579/config.properties io.trino.server.TrinoServer

Host: AArch64, 32 cores, 123G, Red Hat Enterprise Linux release 9.3 (Plow)
Time: Thu Jan  4 03:43:33 2024 UTC elapsed time: 1524.811066 seconds (0d 0h 25m 24s)

---------------  T H R E A D  ---------------

Current thread (0x0000ffe36c9970e0):  JavaThread "ContinuousTaskStatusFetcher-20240104_034212_00312_yvutb.87.11.0-3205" daemon [_thread_in_Java, id=3987, stack(0x0000ffe11a65a000,0x0000ffe11a858000) (2040K)]

Stack: [0x0000ffe11a65a000,0x0000ffe11a858000],  sp=0x0000ffe11a852f50,  free space=2019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x6ce478]  frame::sender_for_entry_frame(RegisterMap*) const+0x128
V  [libjvm.so+0x6c96e8]  vframeStreamForte::forte_next()+0x2f8
V  [libjvm.so+0x6c9d4c]  forte_fill_call_trace_given_top(JavaThread*, ASGCT_CallTrace*, int, frame)+0x25c
V  [libjvm.so+0x6ca450]  AsyncGetCallTrace+0x210
C  [libasyncProfiler-linux-arm64-86b5f622ede6435644a9c1857582e54b4d2f2e55.so+0x1eb08]  Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int, StackContext*) [clone .isra.675]+0x75c
C  [libasyncProfiler-linux-arm64-86b5f622ede6435644a9c1857582e54b4d2f2e55.so+0x1ee24]  Profiler::recordSample(void*, unsigned long long, int, Event*)+0x2dc
C  [libasyncProfiler-linux-arm64-86b5f622ede6435644a9c1857582e54b4d2f2e55.so+0x1feb0]  ITimer::signalHandler(int, siginfo_t*, void*)+0x4c
C  [linux-vdso.so.1+0x83c]  __kernel_rt_sigreturn+0x0

(Happy to share the whole hs_err.log if needed 👍 )

I don't know if there's a way to consistently reproduce the issue; it happens randomly once every day or so while running 100 nodes.

I'm thinking this is pyroscope-related as:

The native frames mention async-profiler
Disabling pyroscope fixes the issue 😅

It may be related to commit https://github.com/async-profiler/async-profiler/commit/d3dde7e5e7e6990b9c0e418ed8683b84fa919bac in async-profiler to support Java 21, that hasn't yet been ported to Grafana's fork?

Jan 08 '24 14:01 Pluies

It may be related to commit https://github.com/async-profiler/async-profiler/commit/d3dde7e5e7e6990b9c0e418ed8683b84fa919bac in async-profiler to support Java 21, that hasn't yet been ported to Grafana's fork?

Not only it has not been ported, but it also has not been included in any stable release

Jan 09 '24 02:01 korniltsev

I think we may consider releasing a SNAPSHOT version of pyroscope-java with async-profiler build from master branch

Jan 09 '24 03:01 korniltsev

@korniltsev async-profiler just released v3.0 that includes this bugfix 🥳

Jan 24 '24 10:01 Pluies

yep, preparing new pyroscope-java release

Jan 24 '24 10:01 korniltsev

Hi, we are also experiencing a similar issue with agent v 0.14.0 with the following info:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f71e00e0e81, pid=1, tid=8
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.2+13 (21.0.2+13) (build 21.0.2+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (21.0.2+13-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xe01e81] ShenandoahConcUpdateRefsClosure::do_oop(oopDesc**)+0x21
#
# Core dump will be written. Default location: /data/core.1
#
# An error report file with more information is saved as: /app/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

Note that we also use the Datadog agent however this issue never occured until introducing the pyroscope agent:

java.opts: "-javaagent:/app/datadog/dd-java-agent.jar -Xms{{xms_java_trader}}G -Xmx{{xmx_java_trader}}G -XX:+UseShenandoahGC -javaagent:/app/datadog/pyroscope-agent.jar -Dio.opentelemetry.javaagent.slf4j.simpleLogger.defaultLogLevel=off -XX:+UseStringDeduplication

any ideas would be much appreciated

Jun 10 '24 10:06 gabrieldimech