jgroups-raft icon indicating copy to clipboard operation
jgroups-raft copied to clipboard

Taking snapshot and committing log are not atomic operations for each other

Open yfei-z opened this issue 1 year ago • 1 comments

Snapshot requires the latest state of the state machine and the commit index, if taking the snapshot in a separated thread, although RAFT.snapshot() is synchronized but RAFT.commitLogTo(long, boolean) is not, if snapshot has the latest state but commit index is not updated, then wrong commit index will be saved in log as the first appended, and duplicated logs will be applied when state machine initial from the snapshot. I think making RAFT.commitLogTo(long, boolean) synchronized could solve the problem and don't need to make the methods of state machine to be thread safe.

yfei-z avatar Aug 20 '24 07:08 yfei-z

I've noticed the snapshot() is exposed in the RaftHandle public API. Are you invoking it from there? Otherwise, the snapshot method is only ever invoked from a single thread, the same as the commitLogTo method. They are in the handling loop which runs on a single thread.

Also, I am handwaving that snapshot() is also exposed as a JMX operation. The concurrency issue here would happen by invoking snapshot() directly through the RaftHandle API while the algorithm handles requests. We could submit a snapshot() operation to the internal queue to synchronize everything.

jabolina avatar Aug 21 '24 13:08 jabolina