NuRaft icon indicating copy to clipboard operation
NuRaft copied to clipboard

Concurrent `save_logical_snp_obj` and `create_snapshot`.

Open alesapin opened this issue 3 years ago • 3 comments

Hi! Seems like it's possible that NuRaft will concurrently run both methods (gdb trace https://gist.github.com/alesapin/973313f4d89f76634b4b1dd7e653e8ff). My code was node ready for this and I've got a deadlock :)

Is it expected behavior? Of course, I'll fix my code, but maybe it would be simpler to manage this on the NuRaft side? Both create and save snapshot methods write data to disk. In my case, it's not a problem, but I can imagine cases when it possibly can be painful.

alesapin avatar Apr 01 '21 11:04 alesapin

Hi @alesapin

NuRaft itself doesn't call save_logical_snp_obj and create_snapshot at the same time. save_logical_snp_obj will be invoked in snapshot transmission mode only, while create_snapshot will be called in normal mode only. The safety of calling them together depends on the implementation of snapshot and state machine, which is irrelevant to NuRaft.

In the call stack you shared, seems to me create_snapshot is not triggered by NuRaft, is that right?

#16 DB::KeeperStateMachine::create_snapshot(nuraft::snapshot&, std::__1::function<void (bool&, std::__1::shared_ptr<std::exception>&)>&)::$_0::operator()(std::__1::shared_ptr<DB::KeeperStorageSnapshot>&&) const (this=0x7b0c00033930, snapshot=...) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStateMachine.cpp:199
#23 DB::KeeperStorageDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b6000004c18) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:85
#24 0x00000000139cc271 in DB::KeeperStorageDispatcher::initialize(Poco::Util::AbstractConfiguration const&)::$_2::operator()() const (this=<optimized out>) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:137

greensky00 avatar Apr 02 '21 05:04 greensky00

Hi @alesapin

NuRaft itself doesn't call save_logical_snp_obj and create_snapshot at the same time. save_logical_snp_obj will be invoked in snapshot transmission mode only, while create_snapshot will be called in normal mode only. The safety of calling them together depends on the implementation of snapshot and state machine, which is irrelevant to NuRaft.

In the call stack you shared, seems to me create_snapshot is not triggered by NuRaft, is that right?

#16 DB::KeeperStateMachine::create_snapshot(nuraft::snapshot&, std::__1::function<void (bool&, std::__1::shared_ptr<std::exception>&)>&)::$_0::operator()(std::__1::shared_ptr<DB::KeeperStorageSnapshot>&&) const (this=0x7b0c00033930, snapshot=...) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStateMachine.cpp:199
#23 DB::KeeperStorageDispatcher::snapshotThread (this=<optimized out>, this@entry=0x7b6000004c18) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:85
#24 0x00000000139cc271 in DB::KeeperStorageDispatcher::initialize(Poco::Util::AbstractConfiguration const&)::$_2::operator()() const (this=<optimized out>) at /home/alesap/code/cpp/ClickHouse/src/Coordination/KeeperStorageDispatcher.cpp:137

Really, my initial post was not precise. NuRaft can call save_logical_snp_obj before user code will call when_done after async create_snapshot. So in my code, I create lambda in create_snapshot and execute in a different thread. But this lambda was not finished (when_done was not called) yet and save_logical_snp_obj was already in progress.

alesapin avatar Apr 02 '21 07:04 alesapin

Ah, now I get your point. It will be somewhat tricky to handle the situation if create_snapshot takes a long time (or even stuck forever) and meanwhile, things changed so that snapshot transfer begins. Need to have time to think about it.

greensky00 avatar Apr 02 '21 16:04 greensky00