truffleruby icon indicating copy to clipboard operation
truffleruby copied to clipboard

Segfault running Protobuf

Open nirvdrum opened this issue 6 months ago • 2 comments

I tried to run the google-protobuf test suite on TruffleRuby and encountered a segfault. I haven't been able to track down the cause of the segfault, but I have stripped it down a small reproduction.

The issue appears to happen with one of the "well-known" data types and type coercion. You can recreate it with the latest google-protobuf release, so there's need to invest the time in setting up a build environment for the gem.

To save some time, I've already generated the source file from the Protobuf schema:

Compiled Protobuf file: time_message_pb.rb
# frozen_string_literal: truepb.rb
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: time_message.proto

require 'google/protobuf'

require 'google/protobuf/duration_pb'


descriptor_data = "\n\x12time_message.proto\x12\x05\x63rash\x1a\x1egoogle/protobuf/duration.proto\":\n\x0bTimeMessage\x12+\n\x08\x64uration\x18\x01 \x01(\x0b\x32\x19.google.protobuf.Durationb\x06proto3"

pool = Google::Protobuf::DescriptorPool.generated_pool
pool.add_serialized_file(descriptor_data)

module Crash
  TimeMessage = ::Google::Protobuf::DescriptorPool.generated_pool.lookup("crash.TimeMessage").msgclass
end

If you'd like to compile it with protoc yourself, I'm also including the Protobuf source file:

Protobuf source file: time_message.proto
syntax = "proto3";

package crash;

import "google/protobuf/duration.proto";

message TimeMessage {
  google.protobuf.Duration duration = 1;
}

Then, compile the file with protoc:

protoc --ruby_out=. time_message.proto

To induce the error, set the duration field on an instance of Crash::TimeMessage. It'll segfault by setting the duration kwarg or by creating the object without any args and then setting the field afterwards.

jt ruby -e 'require_relative "time_message_pb"; p Crash::TimeMessage.new(duration: 10.5)'

That yields:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000072741acab18d, pid=2045954, tid=2045954
#
# JRE version: OpenJDK Runtime Environment GraalVM CE 25-dev+20.1 (25.0+20) (build 25+20-jvmci-b01)
# Java VM: OpenJDK 64-Bit Server VM GraalVM CE 25-dev+20.1 (25+20-jvmci-b01, mixed mode, sharing, tiered, jvmci, jvmci compiler, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x105618d]  Unsafe_GetDouble+0xad
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/nirvdrum/dev/workspaces/truffleruby-ws/core.2045954)
#
# An error report file with more information is saved as:
# /home/nirvdrum/dev/workspaces/truffleruby-ws/hs_err_pid2045954.log
[3.531s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://github.com/oracle/graal/issues
#

The segfault occurs with both the JVM and the Native builds. I've attached one of the hs_err logs.

hs_err_pid2045954.log

nirvdrum avatar May 11 '25 06:05 nirvdrum

It looks like it's trying to read some double out of a null pointer, from hs_err:

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000

If you have a gdb stacktrace that could be helpful to get an idea what the native code is doing.

You could also try to get a guest stacktrace, e.g. by adding a check for null in com.oracle.truffle.llvm.nativemode.runtime.memory.LLVMNativeMemory#getDouble. Actually there is already an assert checkPointer(ptr); there, so just running with enabling assertions (on JVM) should be enough to trigger an AssertionError and that should show the guest stacktrace.

eregon avatar May 13 '25 13:05 eregon

The existing assertions don't catch this case. Here, the invalid pointer has an address like 0xbad000000040160, which satisfies the assertion:

private static boolean checkPointer(long ptr) {
  assert ptr > 0x100000 : "trying to access invalid address: " + ptr + " 0x" + Long.toHexString(ptr);
  return true;
}

Unfortunately, due to an issue preventing the removal of the default Native Image segfault handler, I've been unable to capture a core dump either. Using backtrace and backtrace_symbols, I see the problematic trace as:

  0   protobuf_c.bundle                   0x0000000124539c5c Message_GetUpbMessage + 116
  1   protobuf_c.bundle                   0x00000001245337a4 Convert_RubyToUpb + 372
  2   protobuf_c.bundle                   0x000000012453a378 Message_setfield + 296
  3   protobuf_c.bundle                   0x000000012453a724 Message_method_missing + 692
  4   libtrufflerubytrampoline.dylib      0x0000000104f8d6fc rb_tr_setjmp_wrapper_int_pointer2_to_pointer + 136
  5   libtrufflenfi.dylib                 0x0000000104a7c050 ffi_call_SYSV + 80
  6   libtrufflenfi.dylib                 0x0000000104a7b33c ffi_call_int + 1512
  7   libtrufflenfi.dylib                 0x0000000104a7a5dc executeHelper + 1140
  8   libtrufflenfi.dylib                 0x0000000104a7a0f0 Java_com_oracle_truffle_nfi_backend_libffi_LibFFIContext_executeNative + 140
  9   ???                                 0x000000011918df88 0x0 + 4716027784

I did some printf debugging and the VALUE argument to the method_missing implementation (Message_method_missing) has a bad handle.

nirvdrum avatar May 28 '25 20:05 nirvdrum