NuRaft
NuRaft copied to clipboard
Concurrent `save_logical_snp_obj` and `create_snapshot`.
Hi! Seems like it's possible that NuRaft will concurrently run both methods (gdb trace https://gist.github.com/alesapin/973313f4d89f76634b4b1dd7e653e8ff). My code was node ready for this and I've got a deadlock :)
Is it expected behavior? Of course, I'll fix my code, but maybe it would be simpler to manage this on the NuRaft side? Both create
and save
snapshot methods write data to disk. In my case, it's not a problem, but I can imagine cases when it possibly can be painful.
Hi @alesapin
NuRaft itself doesn't call save_logical_snp_obj
and create_snapshot
at the same time. save_logical_snp_obj
will be invoked in snapshot transmission mode only, while create_snapshot
will be called in normal mode only. The safety of calling them together depends on the implementation of snapshot and state machine, which is irrelevant to NuRaft.
In the call stack you shared, seems to me create_snapshot
is not triggered by NuRaft, is that right?
#16 DB::KeeperStateMachine::create_snapshot(nuraft::snapshot&, std::__1::function<void (bool&, std::__1::shared_ptr<std::exception>&)>&)::$_0::operator()(std::__1::shared_ptr<DB::KeeperStorageSnapshot>&&) const (this=0x7b0c00033930, snapshot=...) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStateMachine.cpp:199
#23 DB::KeeperStorageDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b6000004c18) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:85
#24 0x00000000139cc271 in DB::KeeperStorageDispatcher::initialize(Poco::Util::AbstractConfiguration const&)::$_2::operator()() const (this=<optimized out>) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:137
Hi @alesapin
NuRaft itself doesn't call
save_logical_snp_obj
andcreate_snapshot
at the same time.save_logical_snp_obj
will be invoked in snapshot transmission mode only, whilecreate_snapshot
will be called in normal mode only. The safety of calling them together depends on the implementation of snapshot and state machine, which is irrelevant to NuRaft.In the call stack you shared, seems to me
create_snapshot
is not triggered by NuRaft, is that right?#16 DB::KeeperStateMachine::create_snapshot(nuraft::snapshot&, std::__1::function<void (bool&, std::__1::shared_ptr<std::exception>&)>&)::$_0::operator()(std::__1::shared_ptr<DB::KeeperStorageSnapshot>&&) const (this=0x7b0c00033930, snapshot=...) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStateMachine.cpp:199 #23 DB::KeeperStorageDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b6000004c18) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:85 #24 0x00000000139cc271 in DB::KeeperStorageDispatcher::initialize(Poco::Util::AbstractConfiguration const&)::$_2::operator()() const (this=<optimized out>) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:137
Really, my initial post was not precise. NuRaft can call save_logical_snp_obj
before user code will call when_done
after async create_snapshot
. So in my code, I create lambda in create_snapshot
and execute in a different thread. But this lambda was not finished (when_done
was not called) yet and save_logical_snp_obj
was already in progress.
Ah, now I get your point. It will be somewhat tricky to handle the situation if create_snapshot
takes a long time (or even stuck forever) and meanwhile, things changed so that snapshot transfer begins. Need to have time to think about it.