truffleruby
truffleruby copied to clipboard
Segfault running Protobuf
I tried to run the google-protobuf test suite on TruffleRuby and encountered a segfault. I haven't been able to track down the cause of the segfault, but I have stripped it down a small reproduction.
The issue appears to happen with one of the "well-known" data types and type coercion. You can recreate it with the latest google-protobuf release, so there's need to invest the time in setting up a build environment for the gem.
To save some time, I've already generated the source file from the Protobuf schema:
Compiled Protobuf file: time_message_pb.rb
# frozen_string_literal: truepb.rb
# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: time_message.proto
require 'google/protobuf'
require 'google/protobuf/duration_pb'
descriptor_data = "\n\x12time_message.proto\x12\x05\x63rash\x1a\x1egoogle/protobuf/duration.proto\":\n\x0bTimeMessage\x12+\n\x08\x64uration\x18\x01 \x01(\x0b\x32\x19.google.protobuf.Durationb\x06proto3"
pool = Google::Protobuf::DescriptorPool.generated_pool
pool.add_serialized_file(descriptor_data)
module Crash
TimeMessage = ::Google::Protobuf::DescriptorPool.generated_pool.lookup("crash.TimeMessage").msgclass
end
If you'd like to compile it with protoc yourself, I'm also including the Protobuf source file:
Protobuf source file: time_message.proto
syntax = "proto3";
package crash;
import "google/protobuf/duration.proto";
message TimeMessage {
google.protobuf.Duration duration = 1;
}
Then, compile the file with protoc:
protoc --ruby_out=. time_message.proto
To induce the error, set the duration field on an instance of Crash::TimeMessage. It'll segfault by setting the duration kwarg or by creating the object without any args and then setting the field afterwards.
jt ruby -e 'require_relative "time_message_pb"; p Crash::TimeMessage.new(duration: 10.5)'
That yields:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000072741acab18d, pid=2045954, tid=2045954
#
# JRE version: OpenJDK Runtime Environment GraalVM CE 25-dev+20.1 (25.0+20) (build 25+20-jvmci-b01)
# Java VM: OpenJDK 64-Bit Server VM GraalVM CE 25-dev+20.1 (25+20-jvmci-b01, mixed mode, sharing, tiered, jvmci, jvmci compiler, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x105618d] Unsafe_GetDouble+0xad
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/nirvdrum/dev/workspaces/truffleruby-ws/core.2045954)
#
# An error report file with more information is saved as:
# /home/nirvdrum/dev/workspaces/truffleruby-ws/hs_err_pid2045954.log
[3.531s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
# https://github.com/oracle/graal/issues
#
The segfault occurs with both the JVM and the Native builds. I've attached one of the hs_err logs.
It looks like it's trying to read some double out of a null pointer, from hs_err:
siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000
If you have a gdb stacktrace that could be helpful to get an idea what the native code is doing.
You could also try to get a guest stacktrace, e.g. by adding a check for null in com.oracle.truffle.llvm.nativemode.runtime.memory.LLVMNativeMemory#getDouble.
Actually there is already an assert checkPointer(ptr); there, so just running with enabling assertions (on JVM) should be enough to trigger an AssertionError and that should show the guest stacktrace.
The existing assertions don't catch this case. Here, the invalid pointer has an address like 0xbad000000040160, which satisfies the assertion:
private static boolean checkPointer(long ptr) {
assert ptr > 0x100000 : "trying to access invalid address: " + ptr + " 0x" + Long.toHexString(ptr);
return true;
}
Unfortunately, due to an issue preventing the removal of the default Native Image segfault handler, I've been unable to capture a core dump either. Using backtrace and backtrace_symbols, I see the problematic trace as:
0 protobuf_c.bundle 0x0000000124539c5c Message_GetUpbMessage + 116
1 protobuf_c.bundle 0x00000001245337a4 Convert_RubyToUpb + 372
2 protobuf_c.bundle 0x000000012453a378 Message_setfield + 296
3 protobuf_c.bundle 0x000000012453a724 Message_method_missing + 692
4 libtrufflerubytrampoline.dylib 0x0000000104f8d6fc rb_tr_setjmp_wrapper_int_pointer2_to_pointer + 136
5 libtrufflenfi.dylib 0x0000000104a7c050 ffi_call_SYSV + 80
6 libtrufflenfi.dylib 0x0000000104a7b33c ffi_call_int + 1512
7 libtrufflenfi.dylib 0x0000000104a7a5dc executeHelper + 1140
8 libtrufflenfi.dylib 0x0000000104a7a0f0 Java_com_oracle_truffle_nfi_backend_libffi_LibFFIContext_executeNative + 140
9 ??? 0x000000011918df88 0x0 + 4716027784
I did some printf debugging and the VALUE argument to the method_missing implementation (Message_method_missing) has a bad handle.