qdrant Segmentation fault on shutdown during optimization

The qdrant process produces a segmentation fault when shutting down during the index optimization.

It does not always happen, it is rather rare.

Current Behavior

./cargo_run_follower2.sh: line 13: 474694 Segmentation fault      (core dumped) cargo run --bin qdrant -- --bootstrap 'http://localhost:6335' &> ../cluster-qdrant-test/logs/follower2.log

Logs of the node at the time.

[2023-01-05T10:34:43.896Z DEBUG qdrant::tonic] Stopping gRPC
[2023-01-05T10:34:43.896Z DEBUG hyper::server::shutdown] signal received, starting graceful shutdown
[2023-01-05T10:34:43.896Z INFO  actix_server::server] SIGINT received; starting forced shutdown
[2023-01-05T10:34:43.896Z DEBUG qdrant::tonic] Stopping internal gRPC
[2023-01-05T10:34:43.896Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.896Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.896Z DEBUG hyper::server::shutdown] signal received, starting graceful shutdown
[2023-01-05T10:34:43.896Z DEBUG actix_server::accept] Paused accepting connections on 127.0.0.1:6353
[2023-01-05T10:34:43.896Z INFO  actix_server::accept] Accept thread stopped
[2023-01-05T10:34:43.896Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.897Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.899Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.899Z DEBUG hyper::server::server::new_svc] connection error: connection error: not connected
[2023-01-05T10:34:43.900Z DEBUG hyper::server::server::new_svc] connection error: connection error: not connected
[2023-01-05T10:34:43.900Z DEBUG hyper::server::server::new_svc] connection error: connection error: not connected
[2023-01-05T10:34:43.900Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.902Z INFO  actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:44.151Z DEBUG segment::index::hnsw_index::hnsw] finish main graph
[2023-01-05T10:34:44.180Z DEBUG segment::index::hnsw_index::hnsw] finish additional payload field indexing

I was able to analyze the core dump with gdb, the full log with all stacktraces is here core-dump-gdb.log

The main information is

Core was generated by `target/debug/qdrant --bootstrap http://localhost:6335'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055d096b2e34a in rocksdb::DBImpl::MaybeScheduleFlushOrCompaction (this=0x7f3adc0d5c80) at rocksdb/db/db_impl/db_impl_compaction_flush.cc:2452
2452	      env_->GetBackgroundThreads(Env::Priority::HIGH) == 0;

Steps to Reproduce

Deploy a 3 nodes cluster and kill one node while it optimizes an HNSW index.

Jan 05 '23 11:01 agourlay

Hi 👋

Looking through the provided gdb log, it seems the error happened in the rocksdb code, not the qdrant. I would like to reproduce it this bug. Can you please provide more information about environment where this happened?

May 24 '23 09:05 bazhenov

Hi @bazhenov :wave:

Thank you for your interest in this issue!

I have not tried to reproduce this issue recently, but I used to be able to get it from time to time.

All the gdb logs I got pointed to rocks db, here is the ones I still have:

I run into this issue by running the following test on my laptop (Ubuntu):

create a 3 node cluster
create a single collection with 3 shards and 2 replicas
ingest a few thousand points to trigger the indexing process
manually restart nodes one at a time while the indexing & ingestion process is running

If you get lucky one of those node will crash during shutdown

May 25 '23 08:05 agourlay

Here is what I've found at the moment. I was able to reproduce the issue and catch it under lldb with debug build of qdrant. The stacktrace:

* thread #51, name = 'background_flush', stop reason = EXC_BAD_ACCESS (code=2, address=0x100000002)
  * frame #0: 0x0000000102696a26 qdrant`std::__1::char_traits<char>::assign(__c1=0x0000000100000002, __c2=0x000070000a597216) at char_traits.h:198:73
    frame #1: 0x0000000102a41528 qdrant`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::clear[abi:v15006](this=Summary Unavailable) at string:3335:9
    frame #2: 0x0000000102b8fc29 qdrant`rocksdb::Configurable::AreEquivalent(this=0x00000001044a8d68, config_options=0x0000000000000002, other=0x000070000a5971f8, name=Summary Unavailable) const at configurable.cc:611:9
    frame #3: 0x00000001027dacb0 qdrant`rocksdb::DBImpl::MaybeScheduleFlushOrCompaction(this=0x000000010e86d000) at db_impl_compaction_flush.cc:2578:13
    frame #4: 0x00000001027ceef2 qdrant`rocksdb::DBImpl::InstallSuperVersionAndScheduleWork(this=0x000000010e86d000, cfd=0x000000010e8b4e00, sv_context=0x000070000a5983a8, mutable_cf_options=0x000070000a597848) at db_impl_compaction_flush.cc:3863:3

interesting things happens on the boundary between frames no. 2 and 3. Frame no. 3 pointing to GetBackgroundThreads() call

frame #3: 0x00000001027dacb0 qdrant`rocksdb::DBImpl::MaybeScheduleFlushOrCompaction(this=0x000000010e86d000) at db_impl_compaction_flush.cc:2578:13
   2575   }
   2576   auto bg_job_limits = GetBGJobLimits();
   2577   bool is_flush_pool_empty =
-> 2578       env_->GetBackgroundThreads(Env::Priority::HIGH) == 0;
   2579   while (!is_flush_pool_empty && unscheduled_flushes_ > 0 &&
   2580          bg_flush_scheduled_ < bg_job_limits.max_flushes) {
   2581     bg_flush_scheduled_++;
(lldb) frame variable
(rocksdb::DBImpl *) this = 0x000000010e86d000
(rocksdb::DBImpl::BGJobLimits) bg_job_limits = (max_flushes = 1, max_compactions = 1)
(bool) is_flush_pool_empty = false

Then call flow lands in AreEquivalent() where second and third argument are invalid pointers

frame #2: 0x0000000102b8fc29 qdrant`rocksdb::Configurable::AreEquivalent(this=0x00000001044a8d68, config_options=0x0000000000000002, other=0x000070000a5971f8, name=Summary Unavailable) const at configurable.cc:611:9
   608                                   const Configurable* other,
   609                                   std::string* name) const {
   610    assert(name);
-> 611    name->clear();
   612    if (this == other || config_options.IsCheckDisabled()) {
   613      return true;
   614    } else if (other != nullptr) {
(lldb) frame variable
(rocksdb::Configurable *) this = 0x00000001044a8d68
(const rocksdb::ConfigOptions &) config_options = 0x0000000000000002: {
  ignore_unknown_options = <read memory from 0x2 failed (0 of 1 bytes read)>
  ignore_unsupported_options = <read memory from 0x3 failed (0 of 1 bytes read)>
  input_strings_escaped = <read memory from 0x4 failed (0 of 1 bytes read)>
  invoke_prepare_options = <read memory from 0x5 failed (0 of 1 bytes read)>
  mutable_options_only = <read memory from 0x6 failed (0 of 1 bytes read)>
  delimiter = ""
  depth = <read memory from 0x22 failed (0 of 4 bytes read)>
  sanity_level = <read memory from 0x26 failed (0 of 1 bytes read)>
  file_readahead_size = <read memory from 0x2a failed (0 of 8 bytes read)>
  env = <read memory from 0x32 failed (0 of 8 bytes read)>
  registry = <read memory from 0x3a failed (0 of 16 bytes read)>
}
(const rocksdb::Configurable *) other = 0x000070000a5971f8
(std::string *) name = Summary Unavailable

Looking through RocksDB source I've not been able to find where AreEquivalent() is called. Still digging through this.

Jun 13 '23 11:06 bazhenov

Very happy you were able to reproduce the issue :clap:

I started to think it was actually related to something on my local system.

It could be related to compaction according to my core dumps and your investigation.

Maybe you could try to disable compaction somehow on our Rocksdb setup to see if the bug disappears :thinking:

We have the following configuration in rocksdb_wrapper.rs

pub fn db_options() -> Options {
    let mut options: Options = Options::default();
    options.set_write_buffer_size(DB_CACHE_SIZE);
    options.create_if_missing(true);
    options.set_log_level(LogLevel::Error);
    options.set_recycle_log_file_num(2);
    options.set_max_log_file_size(DB_MAX_LOG_SIZE);
    options.create_missing_column_families(true);
    options.set_max_open_files(DB_MAX_OPEN_FILES as i32);
    #[cfg(debug_assertions)]
    {
        options.set_paranoid_checks(true);
    }
    options
}

Jun 13 '23 13:06 agourlay

Ok, I think I know what's going on.

TLDR: there is a race condition between the main thread shutdown and several threads where RocksDB flush/compaction is executed.

Detailed description

RocksDB is using several static variables when the compaction job is running (cfd_type_info and cs_input_type_info). When main function is returned (when pressing CTRL+C), libc++ library is executing __cxa_finalize_ranges() function which cleanup global static variables.

The issue happens when the compaction job is running at this moment because it's addressing the static variables which are not valid anymore. Every SIGSEGV instance I've faced had two threads that interacting in the following way:

Process 0 stopped
  thread #1: tid = 0x0000, 0x0000000109e2ffb0 qdrant`__cxx_global_array_dtor.14
* thread #9: tid = 0x0008, 0x00007ff8039821f2 libsystem_kernel.dylib`__pthread_kill + 10

Here thread 1 is the main thread running cleanup of static variables and thread 9 is the flushing/compacting job facing an invalid pointer

* thread #1
  * frame #0: 0x0000000109e2ffb0 qdrant`__cxx_global_array_dtor.14
    frame #1: 0x00007ff80388bba1 libsystem_c.dylib`__cxa_finalize_ranges + 409
    frame #2: 0x00007ff80388b9bb libsystem_c.dylib`exit + 35
    frame #3: 0x00007ff8039d18d3 libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 11
    frame #4: 0x00007ff803660458 dyld`start + 1960

* thread #9
  * frame #0: 0x00007ff8039821f2 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff8039b9ee6 libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007ff8038e0b45 libsystem_c.dylib`abort + 123
    frame #3: 0x00007ff8037f7752 libsystem_malloc.dylib`malloc_vreport + 888
    frame #4: 0x00007ff8037fab31 libsystem_malloc.dylib`malloc_report + 151
    frame #5: 0x0000000109c8babc qdrant`rocksdb::DBImpl::MaybeScheduleFlushOrCompaction() + 300
    frame #6: 0x0000000109c80e58 qdrant`rocksdb::DBImpl::InstallSuperVersionAndScheduleWork(rocksdb::ColumnFamilyData*, rocksdb::SuperVersionContext*, rocksdb::MutableCFOptions const&) + 408
    frame #7: 0x0000000109cd1c00 qdrant`rocksdb::DBImpl::SwitchMemtable(rocksdb::ColumnFamilyData*, rocksdb::DBImpl::WriteContext*) + 4144
    frame #8: 0x0000000109c8890f qdrant`rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) + 1423
    frame #9: 0x0000000109c8ea5a qdrant`rocksdb::DBImpl::Flush(rocksdb::FlushOptions const&, rocksdb::ColumnFamilyHandle*) + 474
    frame #10: 0x0000000109be74ec qdrant`rocksdb_flush_cf + 44
    frame #11: 0x000000010965322a qdrant`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h73defdeae483b387 + 378
    frame #12: 0x00000001095aaafa qdrant`_$LT$segment..segment..Segment$u20$as$u20$segment..entry..entry_point..SegmentEntry$GT$::flush::h971de62d706a7c04 + 4378
    frame #13: 0x0000000108a842ce qdrant`collection::collection_manager::holders::segment_holder::SegmentHolder::flush_all::h488f56d193020fc4 + 1086
    frame #14: 0x00000001089232c8 qdrant`collection::update_handler::UpdateHandler::flush_worker::_$u7b$$u7b$closure$u7d$$u7d$::hc834668c932ed5d7 + 2008
    frame #15: 0x0000000108930de9 qdrant`tokio::runtime::task::raw::poll::h23ab895f9b752507 + 361
    frame #16: 0x0000000109ad1446 qdrant`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h715014996561875a + 326
    frame #17: 0x0000000109ac8915 qdrant`tokio::runtime::task::raw::poll::h0d484c9e4b4b52c3 + 1877
    frame #18: 0x0000000109ab8550 qdrant`std::sys_common::backtrace::__rust_begin_short_backtrace::h3258c258d34e8da8 + 368
    frame #19: 0x0000000109ada521 qdrant`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h2ff60a62170c522a + 321
    frame #20: 0x00000001096ffde9 qdrant`std::sys::unix::thread::Thread::new::thread_start::hd752b82ee9981a13 + 41
    frame #21: 0x00007ff8039ba1d3 libsystem_pthread.dylib`_pthread_start + 125
    frame #22: 0x00007ff8039b5bd3 libsystem_pthread.dylib`thread_start + 15

There is mention that RocksDB should be fully closed before main() is returned by @igorcanadi in https://github.com/facebook/rocksdb/issues/649#issuecomment-119779477

Yes, please delete RocksDB before the static destruction kicks in. RocksDB depends on statically allocated objects.

What can be done?

IMO main() should synchronize and wait for all the background activities to finish. To the best of my current knowledge, there are two ways flush is executed now:

ad hoc threads spawned from Segment::flush() https://github.com/qdrant/qdrant/blob/b13df2272cb351986d2a47844127a5280dfbc3cb/lib/segment/src/segment.rs#L1216-L1226
background task UpdateHandler executing in async runtime.

I've tried to build minimal changes to the source code which can prove that issue can be fixed that way. In the main() function I've tried to get gracefully shutdown 3 runtimes (general, search, and update) by calling Runtime::shutdown_timeout(). Unfortunately, I see no easy way to do so (maybe someone can nudge me in the right direction) because owned references should be obtained first.

Jun 14 '23 08:06 bazhenov

Great investigation :+1:

I can see that we are calling lock_flushing to join the background flushing thread when dropping a Segment.

Regarding the UpdateHandler, it seems they are stopped at the end of long chain of before_drop calls when the TableOfContent is being dropped.

To make sure I understand, why are those drop calls not enough in case of a graceful shutdown?

Regarding your current approach, are you trying to set a handler on the ctrlc signal to trigger the various shutdown manually instead?

Jun 14 '23 13:06 agourlay

I can see that we are calling lock_flushing to join the background flushing thread when dropping a Segment.

I'm not sure how exactly this happens, but I saw using lldb that the background_flush thread is getting SIGSEGV. One possible solution is when lock_flushing is rewritten when it is already exists (is Some). In this case already in-flight thread handle will be dropped and the thread will become detached. As far as I able to test forcing sync = true in Segment::flush() is able to fix it (well, not really fix, but to confirm that hypothesis is correct).

To make sure I understand, why are those drop calls not enough in case of a graceful shutdown?

Regarding your current approach, are you trying to set a handler on the ctrlc signal to trigger the various shutdown manually instead?

This part I'm still trying to figure out. When Segment::flush() is "fixed" all SIGSEGVS are from tokio threads running flushing in the context of async tasks. Right now all the runtimes are passed to TableOfContent::new(). What I'm trying to do now is to pass Handle and not Runtime to the TableOfContent and keep all the runtimes in the main() to shutdown them properly at the end.

Jun 14 '23 13:06 bazhenov

Regarding the UpdateHandler, it seems they are stopped at the end of long chain of before_drop calls when the TableOfContent is being dropped.

Not quite. TableOfContents is not owning Collections, those are Arc'ed. So when ToC is dropped all underlying objects will be dropped only if ToC was the last one.

Jun 14 '23 13:06 bazhenov

Ok. It seems working. I can force main() to wait and become last holder of ToC using something like this

let mut toc_arc = toc_arc;
loop {
    match Arc::try_unwrap(toc_arc) {
        Ok(toc) => {
            drop(toc);
            break;
        }
        Err(toc) => {
            toc_arc = toc;
            log::warn!("Waiting for ToC");
            thread::sleep_ms(1000);
        }
    }
}

I abused 3 node cluster as hard as I can, but no SIGSEGVs are happened.

Jun 14 '23 14:06 bazhenov

Nice progress :+1:

Were you somehow able to track which toc references are still alive during your wait loop?

Feel free to create a draft PR to share your results with everyone, it does not have to be perfect right away :)

Jun 15 '23 14:06 agourlay

Were you somehow able to track which toc references are still alive during your wait loop?

Yes, Arc::try_unwrap() returns Ok only if Arc has only one referent left and it's the main().

Feel free to create a draft PR to share your results with everyone, it does not have to be perfect right away :)

Sure, I will.

Jun 15 '23 14:06 bazhenov

qdrant qdrant copied to clipboard

Segmentation fault on shutdown during optimization

Current Behavior

Steps to Reproduce

Detailed description

What can be done?

qdrant
qdrant copied to clipboard