qdrant
qdrant copied to clipboard
Segmentation fault on shutdown during optimization
The qdrant process produces a segmentation fault when shutting down during the index optimization.
It does not always happen, it is rather rare.
Current Behavior
./cargo_run_follower2.sh: line 13: 474694 Segmentation fault (core dumped) cargo run --bin qdrant -- --bootstrap 'http://localhost:6335' &> ../cluster-qdrant-test/logs/follower2.log
Logs of the node at the time.
[2023-01-05T10:34:43.896Z DEBUG qdrant::tonic] Stopping gRPC
[2023-01-05T10:34:43.896Z DEBUG hyper::server::shutdown] signal received, starting graceful shutdown
[2023-01-05T10:34:43.896Z INFO actix_server::server] SIGINT received; starting forced shutdown
[2023-01-05T10:34:43.896Z DEBUG qdrant::tonic] Stopping internal gRPC
[2023-01-05T10:34:43.896Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.896Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.896Z DEBUG hyper::server::shutdown] signal received, starting graceful shutdown
[2023-01-05T10:34:43.896Z DEBUG actix_server::accept] Paused accepting connections on 127.0.0.1:6353
[2023-01-05T10:34:43.896Z INFO actix_server::accept] Accept thread stopped
[2023-01-05T10:34:43.896Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.897Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.899Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.899Z DEBUG hyper::server::server::new_svc] connection error: connection error: not connected
[2023-01-05T10:34:43.900Z DEBUG hyper::server::server::new_svc] connection error: connection error: not connected
[2023-01-05T10:34:43.900Z DEBUG hyper::server::server::new_svc] connection error: connection error: not connected
[2023-01-05T10:34:43.900Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:43.902Z INFO actix_server::worker] Shutting down idle worker
[2023-01-05T10:34:44.151Z DEBUG segment::index::hnsw_index::hnsw] finish main graph
[2023-01-05T10:34:44.180Z DEBUG segment::index::hnsw_index::hnsw] finish additional payload field indexing
I was able to analyze the core dump with gdb, the full log with all stacktraces is here core-dump-gdb.log
The main information is
Core was generated by `target/debug/qdrant --bootstrap http://localhost:6335'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055d096b2e34a in rocksdb::DBImpl::MaybeScheduleFlushOrCompaction (this=0x7f3adc0d5c80) at rocksdb/db/db_impl/db_impl_compaction_flush.cc:2452
2452 env_->GetBackgroundThreads(Env::Priority::HIGH) == 0;
Steps to Reproduce
Deploy a 3 nodes cluster and kill one node while it optimizes an HNSW index.
Hi 👋
Looking through the provided gdb log, it seems the error happened in the rocksdb code, not the qdrant. I would like to reproduce it this bug. Can you please provide more information about environment where this happened?
Hi @bazhenov :wave:
Thank you for your interest in this issue!
I have not tried to reproduce this issue recently, but I used to be able to get it from time to time.
All the gdb logs I got pointed to rocks db, here is the ones I still have:
I run into this issue by running the following test on my laptop (Ubuntu):
- create a 3 node cluster
- create a single collection with 3 shards and 2 replicas
- ingest a few thousand points to trigger the indexing process
- manually restart nodes one at a time while the indexing & ingestion process is running
If you get lucky one of those node will crash during shutdown
Here is what I've found at the moment. I was able to reproduce the issue and catch it under lldb
with debug build of qdrant. The stacktrace:
* thread #51, name = 'background_flush', stop reason = EXC_BAD_ACCESS (code=2, address=0x100000002)
* frame #0: 0x0000000102696a26 qdrant`std::__1::char_traits<char>::assign(__c1=0x0000000100000002, __c2=0x000070000a597216) at char_traits.h:198:73
frame #1: 0x0000000102a41528 qdrant`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::clear[abi:v15006](this=Summary Unavailable) at string:3335:9
frame #2: 0x0000000102b8fc29 qdrant`rocksdb::Configurable::AreEquivalent(this=0x00000001044a8d68, config_options=0x0000000000000002, other=0x000070000a5971f8, name=Summary Unavailable) const at configurable.cc:611:9
frame #3: 0x00000001027dacb0 qdrant`rocksdb::DBImpl::MaybeScheduleFlushOrCompaction(this=0x000000010e86d000) at db_impl_compaction_flush.cc:2578:13
frame #4: 0x00000001027ceef2 qdrant`rocksdb::DBImpl::InstallSuperVersionAndScheduleWork(this=0x000000010e86d000, cfd=0x000000010e8b4e00, sv_context=0x000070000a5983a8, mutable_cf_options=0x000070000a597848) at db_impl_compaction_flush.cc:3863:3
interesting things happens on the boundary between frames no. 2 and 3. Frame no. 3 pointing to GetBackgroundThreads()
call
frame #3: 0x00000001027dacb0 qdrant`rocksdb::DBImpl::MaybeScheduleFlushOrCompaction(this=0x000000010e86d000) at db_impl_compaction_flush.cc:2578:13
2575 }
2576 auto bg_job_limits = GetBGJobLimits();
2577 bool is_flush_pool_empty =
-> 2578 env_->GetBackgroundThreads(Env::Priority::HIGH) == 0;
2579 while (!is_flush_pool_empty && unscheduled_flushes_ > 0 &&
2580 bg_flush_scheduled_ < bg_job_limits.max_flushes) {
2581 bg_flush_scheduled_++;
(lldb) frame variable
(rocksdb::DBImpl *) this = 0x000000010e86d000
(rocksdb::DBImpl::BGJobLimits) bg_job_limits = (max_flushes = 1, max_compactions = 1)
(bool) is_flush_pool_empty = false
Then call flow lands in AreEquivalent()
where second and third argument are invalid pointers
frame #2: 0x0000000102b8fc29 qdrant`rocksdb::Configurable::AreEquivalent(this=0x00000001044a8d68, config_options=0x0000000000000002, other=0x000070000a5971f8, name=Summary Unavailable) const at configurable.cc:611:9
608 const Configurable* other,
609 std::string* name) const {
610 assert(name);
-> 611 name->clear();
612 if (this == other || config_options.IsCheckDisabled()) {
613 return true;
614 } else if (other != nullptr) {
(lldb) frame variable
(rocksdb::Configurable *) this = 0x00000001044a8d68
(const rocksdb::ConfigOptions &) config_options = 0x0000000000000002: {
ignore_unknown_options = <read memory from 0x2 failed (0 of 1 bytes read)>
ignore_unsupported_options = <read memory from 0x3 failed (0 of 1 bytes read)>
input_strings_escaped = <read memory from 0x4 failed (0 of 1 bytes read)>
invoke_prepare_options = <read memory from 0x5 failed (0 of 1 bytes read)>
mutable_options_only = <read memory from 0x6 failed (0 of 1 bytes read)>
delimiter = ""
depth = <read memory from 0x22 failed (0 of 4 bytes read)>
sanity_level = <read memory from 0x26 failed (0 of 1 bytes read)>
file_readahead_size = <read memory from 0x2a failed (0 of 8 bytes read)>
env = <read memory from 0x32 failed (0 of 8 bytes read)>
registry = <read memory from 0x3a failed (0 of 16 bytes read)>
}
(const rocksdb::Configurable *) other = 0x000070000a5971f8
(std::string *) name = Summary Unavailable
Looking through RocksDB source I've not been able to find where AreEquivalent()
is called. Still digging through this.
Very happy you were able to reproduce the issue :clap:
I started to think it was actually related to something on my local system.
It could be related to compaction according to my core dumps and your investigation.
Maybe you could try to disable compaction somehow on our Rocksdb setup to see if the bug disappears :thinking:
We have the following configuration in rocksdb_wrapper.rs
pub fn db_options() -> Options {
let mut options: Options = Options::default();
options.set_write_buffer_size(DB_CACHE_SIZE);
options.create_if_missing(true);
options.set_log_level(LogLevel::Error);
options.set_recycle_log_file_num(2);
options.set_max_log_file_size(DB_MAX_LOG_SIZE);
options.create_missing_column_families(true);
options.set_max_open_files(DB_MAX_OPEN_FILES as i32);
#[cfg(debug_assertions)]
{
options.set_paranoid_checks(true);
}
options
}
Ok, I think I know what's going on.
TLDR: there is a race condition between the main thread shutdown and several threads where RocksDB flush/compaction is executed.
Detailed description
RocksDB is using several static variables when the compaction job is running (cfd_type_info
and cs_input_type_info
). When main
function is returned (when pressing CTRL+C), libc++ library is executing __cxa_finalize_ranges()
function which cleanup global static variables.
The issue happens when the compaction job is running at this moment because it's addressing the static variables which are not valid anymore. Every SIGSEGV instance I've faced had two threads that interacting in the following way:
Process 0 stopped
thread #1: tid = 0x0000, 0x0000000109e2ffb0 qdrant`__cxx_global_array_dtor.14
* thread #9: tid = 0x0008, 0x00007ff8039821f2 libsystem_kernel.dylib`__pthread_kill + 10
Here thread 1 is the main thread running cleanup of static variables and thread 9 is the flushing/compacting job facing an invalid pointer
* thread #1
* frame #0: 0x0000000109e2ffb0 qdrant`__cxx_global_array_dtor.14
frame #1: 0x00007ff80388bba1 libsystem_c.dylib`__cxa_finalize_ranges + 409
frame #2: 0x00007ff80388b9bb libsystem_c.dylib`exit + 35
frame #3: 0x00007ff8039d18d3 libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 11
frame #4: 0x00007ff803660458 dyld`start + 1960
* thread #9
* frame #0: 0x00007ff8039821f2 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007ff8039b9ee6 libsystem_pthread.dylib`pthread_kill + 263
frame #2: 0x00007ff8038e0b45 libsystem_c.dylib`abort + 123
frame #3: 0x00007ff8037f7752 libsystem_malloc.dylib`malloc_vreport + 888
frame #4: 0x00007ff8037fab31 libsystem_malloc.dylib`malloc_report + 151
frame #5: 0x0000000109c8babc qdrant`rocksdb::DBImpl::MaybeScheduleFlushOrCompaction() + 300
frame #6: 0x0000000109c80e58 qdrant`rocksdb::DBImpl::InstallSuperVersionAndScheduleWork(rocksdb::ColumnFamilyData*, rocksdb::SuperVersionContext*, rocksdb::MutableCFOptions const&) + 408
frame #7: 0x0000000109cd1c00 qdrant`rocksdb::DBImpl::SwitchMemtable(rocksdb::ColumnFamilyData*, rocksdb::DBImpl::WriteContext*) + 4144
frame #8: 0x0000000109c8890f qdrant`rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) + 1423
frame #9: 0x0000000109c8ea5a qdrant`rocksdb::DBImpl::Flush(rocksdb::FlushOptions const&, rocksdb::ColumnFamilyHandle*) + 474
frame #10: 0x0000000109be74ec qdrant`rocksdb_flush_cf + 44
frame #11: 0x000000010965322a qdrant`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h73defdeae483b387 + 378
frame #12: 0x00000001095aaafa qdrant`_$LT$segment..segment..Segment$u20$as$u20$segment..entry..entry_point..SegmentEntry$GT$::flush::h971de62d706a7c04 + 4378
frame #13: 0x0000000108a842ce qdrant`collection::collection_manager::holders::segment_holder::SegmentHolder::flush_all::h488f56d193020fc4 + 1086
frame #14: 0x00000001089232c8 qdrant`collection::update_handler::UpdateHandler::flush_worker::_$u7b$$u7b$closure$u7d$$u7d$::hc834668c932ed5d7 + 2008
frame #15: 0x0000000108930de9 qdrant`tokio::runtime::task::raw::poll::h23ab895f9b752507 + 361
frame #16: 0x0000000109ad1446 qdrant`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h715014996561875a + 326
frame #17: 0x0000000109ac8915 qdrant`tokio::runtime::task::raw::poll::h0d484c9e4b4b52c3 + 1877
frame #18: 0x0000000109ab8550 qdrant`std::sys_common::backtrace::__rust_begin_short_backtrace::h3258c258d34e8da8 + 368
frame #19: 0x0000000109ada521 qdrant`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h2ff60a62170c522a + 321
frame #20: 0x00000001096ffde9 qdrant`std::sys::unix::thread::Thread::new::thread_start::hd752b82ee9981a13 + 41
frame #21: 0x00007ff8039ba1d3 libsystem_pthread.dylib`_pthread_start + 125
frame #22: 0x00007ff8039b5bd3 libsystem_pthread.dylib`thread_start + 15
There is mention that RocksDB should be fully closed before main()
is returned by @igorcanadi in https://github.com/facebook/rocksdb/issues/649#issuecomment-119779477
Yes, please delete RocksDB before the static destruction kicks in. RocksDB depends on statically allocated objects.
What can be done?
IMO main()
should synchronize and wait for all the background activities to finish. To the best of my current knowledge, there are two ways flush is executed now:
- ad hoc threads spawned from
Segment::flush()
https://github.com/qdrant/qdrant/blob/b13df2272cb351986d2a47844127a5280dfbc3cb/lib/segment/src/segment.rs#L1216-L1226 - background task
UpdateHandler
executing in async runtime.
I've tried to build minimal changes to the source code which can prove that issue can be fixed that way. In the main()
function I've tried to get gracefully shutdown 3 runtimes (general, search, and update) by calling Runtime::shutdown_timeout()
. Unfortunately, I see no easy way to do so (maybe someone can nudge me in the right direction) because owned references should be obtained first.
Great investigation :+1:
I can see that we are calling lock_flushing
to join the background flushing thread when dropping a Segment
.
Regarding the UpdateHandler
, it seems they are stopped at the end of long chain of before_drop
calls when the TableOfContent
is being dropped.
To make sure I understand, why are those drop calls not enough in case of a graceful shutdown?
Regarding your current approach, are you trying to set a handler on the ctrlc
signal to trigger the various shutdown manually instead?
I can see that we are calling
lock_flushing
to join the background flushing thread when dropping aSegment
.
I'm not sure how exactly this happens, but I saw using lldb
that the background_flush
thread is getting SIGSEGV. One possible solution is when lock_flushing
is rewritten when it is already exists (is Some
). In this case already in-flight thread handle will be dropped and the thread will become detached. As far as I able to test forcing sync = true
in Segment::flush()
is able to fix it (well, not really fix, but to confirm that hypothesis is correct).
To make sure I understand, why are those drop calls not enough in case of a graceful shutdown?
Regarding your current approach, are you trying to set a handler on the
ctrlc
signal to trigger the various shutdown manually instead?
This part I'm still trying to figure out. When Segment::flush()
is "fixed" all SIGSEGVS are from tokio threads running flushing in the context of async tasks. Right now all the runtimes are passed to TableOfContent::new()
. What I'm trying to do now is to pass Handle
and not Runtime
to the TableOfContent
and keep all the runtimes in the main()
to shutdown them properly at the end.
Regarding the
UpdateHandler
, it seems they are stopped at the end of long chain ofbefore_drop
calls when theTableOfContent
is being dropped.
Not quite. TableOfContents
is not owning Collection
s, those are Arc
'ed. So when ToC is dropped all underlying objects will be dropped only if ToC was the last one.
Ok. It seems working. I can force main()
to wait and become last holder of ToC
using something like this
let mut toc_arc = toc_arc;
loop {
match Arc::try_unwrap(toc_arc) {
Ok(toc) => {
drop(toc);
break;
}
Err(toc) => {
toc_arc = toc;
log::warn!("Waiting for ToC");
thread::sleep_ms(1000);
}
}
}
I abused 3 node cluster as hard as I can, but no SIGSEGVs are happened.
Nice progress :+1:
Were you somehow able to track which toc
references are still alive during your wait loop?
Feel free to create a draft PR to share your results with everyone, it does not have to be perfect right away :)
Were you somehow able to track which
toc
references are still alive during your wait loop?
Yes, Arc::try_unwrap()
returns Ok
only if Arc
has only one referent left and it's the main()
.
Feel free to create a draft PR to share your results with everyone, it does not have to be perfect right away :)
Sure, I will.