rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

SIGSEGV calling EventListener's onCompactionBegin() callback method with rocksdbjni 7.4.5

Open dparrella opened this issue 2 years ago • 5 comments

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Hello RocksDB Team!

My team has found a serious issue in production where RocksDB crashed and took the JVM down with it. This started when we introduced an EventListener that we were hoping would give us more targeted logging of events as they occur. In fact we added this to help with investigating #10258. The issue occurs specifically when background compaction starts and RocksDB attempts to call our EventListener's onCompactionBegin() method.

Any insights as to what we may be doing wrong or help to resolve the issue would be much appreciated!

Expected behavior

Background compaction event callbacks should not crash the JVM.

Actual behavior

JVM exits with SIGSEGV while calling onCompactionBegin() callback method on EventListener.

Partial dump from crash:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007eff76e93c51, pid=1, tid=146
#
# JRE version: OpenJDK Runtime Environment Corretto-11.0.16.9.1 (11.0.16.1+9) (build 11.0.16.1+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.16.9.1 (11.0.16.1+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x5a9c51]  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<1097844ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 1097844ul>::oop_access_barrier(void*)+0x1
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-11/issues/
#
---------------  T H R E A D  ---------------

Current thread (0x00007efe78004000):  JavaThread "Thread-3" [_thread_in_vm, id=146, stack(0x00007efe857fe000,0x00007efe861fd000)]

Stack: [0x00007efe857fe000,0x00007efe861fd000],  sp=0x00007efe861fa870,  free space=10226k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5a9c51]  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<1097844ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 1097844ul>::oop_access_barrier(void*)+0x1
V  [libjvm.so+0x91bba0]  jni_GetMethodID+0x70
C  [librocksdbjni1005257159522923149.so+0x2799fe]  rocksdb::EventListenerJniCallback::OnCompactionBegin(rocksdb::DB*, rocksdb::CompactionJobInfo const&)+0x8e
C  [librocksdbjni1005257159522923149.so+0x3c84f4]  rocksdb::DBImpl::NotifyOnCompactionBegin(rocksdb::ColumnFamilyData*, rocksdb::Compaction*, rocksdb::Status const&, rocksdb::CompactionJobStats const&, int)+0x204
C  [librocksdbjni1005257159522923149.so+0x3cda0e]  rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x119e
C  [librocksdbjni1005257159522923149.so+0x3ce9e7]  rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xd7
C  [librocksdbjni1005257159522923149.so+0x3cef2e]  rocksdb::DBImpl::BGWorkCompaction(void*)+0x4e
C  [librocksdbjni1005257159522923149.so+0x6e9248]  rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x228
C  [librocksdbjni1005257159522923149.so+0x6e9412]  rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x62


siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

Register to memory mapping:

RAX=0x0 is NULL
RBX=0x00007eff411a0438 points into unknown readable memory: 0x693c1495ffff0006 | 06 00 ff ff 95 14 3c 69
RCX=0x0000000000000004 is an unknown value
RDX=0x0 is NULL
RSP=0x00007efe861fa870 is pointing into the stack for thread: 0x00007efe78004000
RBP=0x00007efe861fa8d0 is pointing into the stack for thread: 0x00007efe78004000
RSI=0x000000000000000b is an unknown value
RDI=0x0 is NULL
R8 =0x00000000001349ef is an unknown value
R9 =0x0000000000000028 is an unknown value
R10=0x0 is NULL
R11=0x0 is NULL
R12=0x00007efe861fa890 is pointing into the stack for thread: 0x00007efe78004000
R13=0x00007efed217a40d: <offset 0x00000000008ed40d> in /tmp/librocksdbjni1005257159522923149.so at 0x00007efed188d000
R14=0x00007eff411a48a0 points into unknown readable memory: 0x4a28c588ffff0004 | 04 00 ff ff 88 c5 28 4a
R15=0x00007efe78004000 is a thread


Registers:
RAX=0x0000000000000000, RBX=0x00007eff411a0438, RCX=0x0000000000000004, RDX=0x0000000000000000
RSP=0x00007efe861fa870, RBP=0x00007efe861fa8d0, RSI=0x000000000000000b, RDI=0x0000000000000000
R8 =0x00000000001349ef, R9 =0x0000000000000028, R10=0x0000000000000000, R11=0x0000000000000000
R12=0x00007efe861fa890, R13=0x00007efed217a40d, R14=0x00007eff411a48a0, R15=0x00007efe78004000
RIP=0x00007eff76e93c51, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007efe861fa870)
0x00007efe861fa870:   00007efe861fa8d0 00007eff77203cd8
0x00007efe861fa880:   0000000000000000 0000000000000000
0x00007efe861fa890:   00007efe001349ef 0000000000000000
0x00007efe861fa8a0:   00007efe861fa8d0 00007efe78004000

Instructions: (pc=0x00007eff76e93c51)
0x00007eff76e93b51:   03 b8 05 00 00 00 e9 2a fd ff ff 88 4a 02 b8 03
0x00007eff76e93b61:   00 00 00 e9 1d fd ff ff 88 42 02 b8 03 00 00 00
0x00007eff76e93b71:   e9 e0 fe ff ff 41 88 46 02 b8 03 00 00 00 e9 e5
0x00007eff76e93b81:   fd ff ff 41 88 46 03 b8 04 00 00 00 e9 d7 fd ff
0x00007eff76e93b91:   ff 88 42 03 b8 04 00 00 00 e9 b7 fe ff ff 41 88
0x00007eff76e93ba1:   46 04 b8 05 00 00 00 e9 bc fd ff ff 88 42 04 b8
0x00007eff76e93bb1:   05 00 00 00 e9 9c fe ff ff 41 88 46 05 b8 06 00
0x00007eff76e93bc1:   00 00 e9 a1 fd ff ff 88 42 05 b8 06 00 00 00 e9
0x00007eff76e93bd1:   81 fe ff ff 83 c8 80 49 c1 ed 38 41 88 46 07 45
0x00007eff76e93be1:   88 6e 08 b8 09 00 00 00 e9 7b fd ff ff 83 c8 80
0x00007eff76e93bf1:   49 c1 ed 38 88 42 07 44 88 6a 08 b8 09 00 00 00
0x00007eff76e93c01:   e9 50 fe ff ff 41 88 46 06 b8 07 00 00 00 e9 55
0x00007eff76e93c11:   fd ff ff 88 42 06 b8 07 00 00 00 e9 35 fe ff ff
0x00007eff76e93c21:   66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
0x00007eff76e93c31:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c41:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c51:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c61:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c71:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c81:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c91:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93ca1:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93cb1:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93cc1:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93cd1:   48 8b 37 48 89 e5 48 85 f6 74 10 48 8d 05 cd 38
0x00007eff76e93ce1:   f5 00 5d 48 8b 38 e9 14 5d 76 00 31 c0 5d c3 55
0x00007eff76e93cf1:   48 8b 37 48 89 e5 48 85 f6 74 10 48 8d 05 ad 38
0x00007eff76e93d01:   f5 00 5d 48 8b 38 e9 f4 5c 76 00 31 c0 5d c3 55
0x00007eff76e93d11:   48 89 e5 53 48 83 ec 08 48 8b 1f 48 85 db 74 08
0x00007eff76e93d21:   48 89 df e8 97 b9 1c 00 48 83 c4 08 48 89 d8 5b
0x00007eff76e93d31:   5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55
0x00007eff76e93d41:   48 89 e5 53 48 83 ec 08 48 8b 1f 48 85 db 74 08


Stack slot to memory mapping:
stack at sp + 0 slots: 0x00007efe861fa8d0 is pointing into the stack for thread: 0x00007efe78004000
stack at sp + 1 slots: 0x00007eff77203cd8: <offset 0x0000000000919cd8> in /usr/lib/jvm/java-11-amazon-corretto.x86_64/lib/server/libjvm.so at 0x00007eff768ea000
stack at sp + 2 slots: 0x0 is NULL
stack at sp + 3 slots: 0x0 is NULL
stack at sp + 4 slots: 0x00007efe001349ef is an unknown value
stack at sp + 5 slots: 0x0 is NULL
stack at sp + 6 slots: 0x00007efe861fa8d0 is pointing into the stack for thread: 0x00007efe78004000
stack at sp + 7 slots: 0x00007efe78004000 is a thread

Steps to reproduce the behavior

We defined a simple EventListener implementation that subscribes to the following events:

  • EnabledEventCallback.ON_BACKGROUND_ERROR
  • EnabledEventCallback.ON_COMPACTION_BEGIN
  • EnabledEventCallback.ON_COMPACTION_COMPLETED
  • EnabledEventCallback.ON_ERROR_RECOVERY_BEGIN
  • EnabledEventCallback.ON_ERROR_RECOVERY_COMPLETED
  • EnabledEventCallback.ON_EXTERNAL_FILE_INGESTED
  • EnabledEventCallback.ON_TABLE_FILE_DELETED

Then in our onCompactionBegin() implementation we log (perhaps a bit too verbosely) using SLF4J:

    private void logCompactionInfo(final String message, final CompactionJobInfo compactionJobInfo) {
        log.info("{}: Status code: [{}], reason: [{}], inputFiles: [{}], outputFiles: [{}], jobId: [{}],"
                        + "baseInputLevel: [{}], columnFamilyName: [{}], compression: [{}], threadId: [{}].",
                message,
                compactionJobInfo.status().getCodeString(),
                compactionJobInfo.compactionReason(),
                compactionJobInfo.inputFiles(),
                compactionJobInfo.outputFiles(),
                compactionJobInfo.jobId(),
                compactionJobInfo.baseInputLevel(),
                new String(compactionJobInfo.columnFamilyName(), StandardCharsets.UTF_8),
                compactionJobInfo.compression(),
                compactionJobInfo.threadId());
        compactionJobInfo.tableProperties().entrySet().stream().forEach(
                entry -> log.info("RocksDB table properties for key [{}], ColumnFamilyName: [{}], CompressionName: [{}], CreationTime: [{}],"
                                + "DataSize: [{}], FilterPolicyName: [{}], FilterSize: [{}], FixedKeyLen: [{}], MergeOperatorName: [{}],"
                                + "NumDataBlocks: [{}], NumDeletions: [{}], NumRangeDeletions: [{}], ReadableProperties: [{}], "
                                + "TopLevelIndexSize: [{}], UserCollectedProperties:[{}]",
                        entry.getKey(),
                        new String(entry.getValue().getColumnFamilyName(), StandardCharsets.UTF_8),
                        entry.getValue().getCompressionName(),
                        entry.getValue().getCreationTime(),
                        entry.getValue().getDataSize(),
                        entry.getValue().getFilterPolicyName(),
                        entry.getValue().getFilterSize(),
                        entry.getValue().getFixedKeyLen(),
                        entry.getValue().getMergeOperatorName(),
                        entry.getValue().getNumDataBlocks(),
                        entry.getValue().getNumDeletions(),
                        entry.getValue().getNumRangeDeletions(),
                        entry.getValue().getReadableProperties(),
                        entry.getValue().getTopLevelIndexSize(),
                        entry.getValue().getUserCollectedProperties())
        );
    }

What's strange is that we saw this for the first time in one of our production environments. The same code made its way through several stages of test environments before going to production.

Also worth noting is that the data being loaded into RocksDB is identical to the previously deployed version of our application. The RocksDB version is also the same. The only difference is that we added the EventListener.

One thing that is different with this production system is the size of the database. We store a "snapshot" of the database in a tar.gz file that is downloaded when the application bootstraps. The compressed archive is about 2.5 GB - much larger than what we have in our test systems.

The listener is added like this:

final Options options = new Options()
                .setCompactionStyle(getRocksDBCompactionStyle())
                .setCompressionType(getCompressionType())
                .setCreateIfMissing(true)
                .setIncreaseParallelism(getIncreaseParallelismThreadCount())
                .setKeepLogFileNum(keepLogFileNum)
                .setListeners(List.of(new RocksDBEventListener()))
                .setLogger(new RocksDBLogger().getLogger())
                .setMaxOpenFiles(maxOpenFiles)
                .setMaxSubcompactions(maxSubcompactions)
                .setAllowMmapReads(allowMmapReads)
                .setAllowMmapWrites(allowMmapWrites)
                .setTableFormatConfig(tableConfig)
                .optimizeForPointLookup(blockCacheSizeMB)
                .setRateLimiter(getRateLimiter());

dparrella avatar Sep 15 '22 19:09 dparrella

I met the same problem, besides, I found NoClassDefFoundError in the dump output. I guess the error my be related to spring boot classloader?

Event: 810.052 Thread 0x00007f2307048000 Exception <a 'java/lang/NoClassDefFoundError'{0x00000007fcf45240}: org/rocksdb/FlushJobInfo> (0x00000007fcf45240) thrown at [./src/hotspot/share/classfile/systemDictionary.cpp, line 235] Event: 810.053 Thread 0x00007f223b358000 Exception <a 'java/lang/NoClassDefFoundError'{0x00000007fcf48f40}: org/rocksdb/CompactionJobInfo> (0x00000007fcf48f40) thrown at [./src/hotspot/share/classfile/systemDictionary.cpp, line 235]

PuGuanyu avatar Sep 21 '22 12:09 PuGuanyu

Hey there, looks like this has been out standing for quite a while.

But I recently met the same issue - Adding EventListener caused the application to exit with SIGSEGV. And I can also find NoClassDefFoundError in the error report. I'm using rocksdbjni 8.6.7, running a Java Spring Boot 3.2.3 project.

Wondering if there's any ongoing work addressing this issue? Thanks!

zz-x404 avatar Apr 23 '24 07:04 zz-x404

As a workaround, I tried implementing the constructor of my subclass of AbstractEventListener https://github.com/facebook/rocksdb/blob/9d37408f9af15c7a1ae42f9b94d06b27d98a011a/java/src/main/java/org/rocksdb/AbstractEventListener.java#L73-L89 only calling super(ON_BACKGROUND_ERROR, ON_ERROR_RECOVERY_BEGIN, ON_ERROR_RECOVERY_COMPLETED) for what I need.

At least now I'm not seeing the application exiting nor getting the NoClassDefFoundError errors.

Hope that narrows the problem down a little, perhaps.

zz-x404 avatar Apr 23 '24 20:04 zz-x404

Hello @dparrella @PuGuanyu @zz-x404

Thank you for reporting this issue and sorry it took us so long to pick it up. I did some internal tests and I'm not able to reproduce this issue. I Even modified tests that compaction is triggered from RocksDB code and not from Java. Unfortunately no luck.

Do you think you will be able to recompile RocksDB in debug mode with DEBUG_LEVEL=2 PORTABLE=1 and run it with your code. This will provide a better stack trace. Can you please also resolve the line number as described here : https://github.com/facebook/rocksdb/wiki/JNI-Debugging#stack This information with exach SHA commit hash you used to compile will help us a lot.

If you are not able to compile source code by yourself, can you please provide Maven project with JUnit test where you are able to reproduce this error.

Radek

rhubner avatar Jun 18 '24 14:06 rhubner

Hello Hello @dparrella @PuGuanyu @zz-x404,

I'm able to replicate your issue. It's really problem with class loader. Looks simmilar like #691 . I will submit fix soon.

Radek

rhubner avatar Jul 01 '24 08:07 rhubner