SSD-tier: full sync replication crash or hang master
Describe the bug make full sync replication from a ssd-tier master node (no replica) will crash or hang it.
To Reproduce Steps to reproduce the behavior:
- run a single master node (no replica) with ssd-tier opening.
- write 20,000,000 ~ 30,000,000 to master, string type kv, key len 32 and value 2KB about that will consume 36G ssd storage.
- start another node (same config as master).
- replicaof (master ip:port).
- run not long time, the master will crash or hang.
- if hang, operation on the replica will output:
(error) LOADING Dragonfly is loading the dataset in memory - if crash, the master stderr output:
*** SIGFPE received at time=1745151458 on cpu 2 *** PC: @ 0x5b216091c605 (unknown) mi_free_generic_mt or *** SIGSEGV received at time=1745156745 on cpu 1 *** PC: @ 0x569e90c3df04 (unknown) util::fb2::EventCount::await<>()
Expected behavior full sync normally then replica transfer to partial replication and working well
Screenshots
Environment (please complete the following information):
- OS: [ubuntu 24.04]
- hang Kernel:
Linux ubuntu2404152192 6.8.0-57-generic #59-Ubuntu SMP PREEMPT_DYNAMIC Sat Mar 15 17:40:59 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux - crash Kernal:
6.6.72+ #1 SMP PREEMPT_DYNAMIC Sun Mar 30 09:01:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux - Containerized?: [hang testing in IDC Metal's QEMU, crash testing in google cloud vm ]
- Dragonfly Version: [1.28.2]
Reproducible Code Snippet
# Minimal code snippet to reproduce this bug
Additional context
@zhyhang Thanks for reporting the issue.
I reproduced it locally with master in debug or opt mode. easily reproduces.
mkdir /tmp/logmaster /tmp/logslave
Master
config:
# Dragonfly Configuration File
--port=6380
--maxmemory=10G
# Enable cache mode (evict keys when near maxmemory)
--cache_mode=true
--proactor_threads=4
# Directory to store log files
--log_dir=/tmp/logmaster
--dbfilename=
--tiered_prefix=/mnt/vol/tiered_master
--tiered_offload_threshold=0.2
--cluster_mode=emulated
and then debug POPULATE 5000000 key 4096
Slave
slave config:
--port=6381
--maxmemory=10G
--cache_mode=true
--proactor_threads=4
# Directory to store log files
--log_dir=/tmp/logslave
--dbfilename=
--tiered_prefix=/mnt/vol/tiered_slave
--tiered_offload_threshold=0.2
--cluster_mode=emulated
Run it via: taskset -c 4-8 ./dragonfly --flagfile ~/slave.conf
and then slaveof localhost 6380
Once the CPU goes down - info replication on slave shows its still in full sync.
info replication on master - is stuck.
debug stacktrace dumps the following (pasted only interesting bits):
I20250620 08:59:20.315248 51328 scheduler.cc:487] ------------ Fiber Dispatched (suspended:523947ms) ------------
0xb2ed88e95c10 std::function<>::operator()()
0xb2ed88e9362c util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0xb2ed88e948b4 boost::context::detail::fiber_ontop<>()
0xb2ed88e941f8 boost::context::fiber::resume_with<>()
0xb2ed88e92e6c util::fb2::detail::FiberInterface::SwitchTo()
0xb2ed88e883b8 util::fb2::detail::Scheduler::Preempt()
0xb2ed88166eb8 util::fb2::detail::FiberInterface::Suspend()
0xb2ed881675a8 util::fb2::EventCount::wait()
0xb2ed887a202c util::fb2::EventCount::await<>()
0xb2ed887a17e8 util::fb2::SharedMutex::lock()
0xb2ed88a8d434 std::lock_guard<>::lock_guard()
0xb2ed88a8b58c dfly::journal::JournalSlice::UnregisterOnChange()
0xb2ed88a7af6c dfly::journal::Journal::UnregisterOnChange()
0xb2ed8836f6d0 dfly::SliceSnapshot::FinalizeJournalStream()
0xb2ed88340cac dfly::RdbSaver::Impl::StopSnapshotting()
I20250620 08:59:20.261518 51328 scheduler.cc:487] ------------ Fiber shard_handler_periodic1 (suspended:523892ms) ---
---------
0xb2ed88e95c10 std::function<>::operator()()
0xb2ed88e9362c util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0xb2ed88e948b4 boost::context::detail::fiber_ontop<>()
0xb2ed88e941f8 boost::context::fiber::resume_with<>()
0xb2ed88e92e6c util::fb2::detail::FiberInterface::SwitchTo()
0xb2ed88e883b8 util::fb2::detail::Scheduler::Preempt()
0xb2ed88166eb8 util::fb2::detail::FiberInterface::Suspend()
0xb2ed881675a8 util::fb2::EventCount::wait()
0xb2ed887a2150 util::fb2::EventCount::await<>()
0xb2ed887a18b8 util::fb2::SharedMutex::lock_shared()
0xb2ed8888f03c dfly::SharedLock<>::SharedLock()
0xb2ed88870548 dfly::DflyCmd::BreakStalledFlowsInShard()
0xb2ed8825cd08 dfly::Service::Init()::{lambda()#1}::operator()()
0xb2ed8827d948 std::__invoke_impl<>()
0xb2ed882795a0 std::__invoke_r<>()
I20250620 08:59:20.261500 51328 scheduler.cc:487] ------------ Fiber heartbeat_periodic1 (suspended:532681ms) -------
-----
0xb2ed88e95c10 std::function<>::operator()()
0xb2ed88e9362c util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0xb2ed88e948b4 boost::context::detail::fiber_ontop<>()
0xb2ed88e941f8 boost::context::fiber::resume_with<>()
0xb2ed88e92e6c util::fb2::detail::FiberInterface::SwitchTo()
0xb2ed88e883b8 util::fb2::detail::Scheduler::Preempt()
0xb2ed88166eb8 util::fb2::detail::FiberInterface::Suspend()
0xb2ed881675a8 util::fb2::EventCount::wait()
0xb2ed8821b81c util::fb2::EventCount::await<>()
0xb2ed88215c30 util::fb2::Future<>::Get()
0xb2ed88371914 dfly::SliceSnapshot::PushSerialized()
0xb2ed883720e0 dfly::SliceSnapshot::ThrottleIfNeeded()
0xb2ed88a8ab78 dfly::journal::JournalSlice::SetFlushMode()
0xb2ed88a7b11c dfly::journal::Journal::SetFlushMode()
0xb2ed882120bc dfly::journal::JournalFlushGuard::~JournalFlushGuard()
I20250620 08:59:20.244653 51329 scheduler.cc:487] ------------ Fiber Dispatched (suspended:523877ms) ------------
0xb2ed88e95c10 std::function<>::operator()()
0xb2ed88e9362c util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0xb2ed88e948b4 boost::context::detail::fiber_ontop<>()
0xb2ed88e941f8 boost::context::fiber::resume_with<>()
0xb2ed88e92e6c util::fb2::detail::FiberInterface::SwitchTo()
0xb2ed88e883b8 util::fb2::detail::Scheduler::Preempt()
0xb2ed88166eb8 util::fb2::detail::FiberInterface::Suspend()
0xb2ed881675a8 util::fb2::EventCount::wait()
0xb2ed887a202c util::fb2::EventCount::await<>()
0xb2ed887a17e8 util::fb2::SharedMutex::lock()
0xb2ed88a8d434 std::lock_guard<>::lock_guard()
0xb2ed88a8b58c dfly::journal::JournalSlice::UnregisterOnChange()
0xb2ed88a7af6c dfly::journal::Journal::UnregisterOnChange()
0xb2ed8836f6d0 dfly::SliceSnapshot::FinalizeJournalStream()
0xb2ed88340cac dfly::RdbSaver::Impl::StopSnapshotting()
I20250620 08:59:20.108386 51327 scheduler.cc:487] ------------ Fiber heartbeat_periodic0 (suspended:531536ms) -------
-----
0xb2ed88e95c10 std::function<>::operator()()
0xb2ed88e9362c util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0xb2ed88e948b4 boost::context::detail::fiber_ontop<>()
0xb2ed88e941f8 boost::context::fiber::resume_with<>()
0xb2ed88e92e6c util::fb2::detail::FiberInterface::SwitchTo()
0xb2ed88e883b8 util::fb2::detail::Scheduler::Preempt()
0xb2ed88166eb8 util::fb2::detail::FiberInterface::Suspend()
0xb2ed881675a8 util::fb2::EventCount::wait()
0xb2ed8821b81c util::fb2::EventCount::await<>()
0xb2ed88215c30 util::fb2::Future<>::Get()
0xb2ed88371914 dfly::SliceSnapshot::PushSerialized()
0xb2ed883720e0 dfly::SliceSnapshot::ThrottleIfNeeded()
0xb2ed88a8ab78 dfly::journal::JournalSlice::SetFlushMode()
0xb2ed88a7b11c dfly::journal::Journal::SetFlushMode()
0xb2ed882120bc dfly::journal::JournalFlushGuard::~JournalFlushGuard()