zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

ZOOKEEPER-4858: Remove the lock contention between snapshotting and the sync operation

Open li4wang opened this issue 1 year ago • 3 comments

Author: Li Wang [email protected]

li4wang avatar Sep 07 '24 02:09 li4wang

@eolivelli can I get a review for this? Thanks

li4wang avatar Sep 12 '24 01:09 li4wang

Thanks for reviewing it, @anmolnar.

Yes, we would still want exclusive locking between snapshot and restore to make sure data tree is not replaced by restoring operation while snapshotting is in progress.

li4wang avatar Oct 16 '24 02:10 li4wang

added a specific lock just for snapshot and restore operations. @anmolnar would you mind taking a look at it. Thanks.

li4wang avatar Oct 16 '24 02:10 li4wang

@anmolnar I've addressed your comment. Can you take a quick look at it? Thanks.

li4wang avatar Oct 21 '24 17:10 li4wang

Are you sure that we don't need to synchronize between snapshotting and sync()?

anmolnar avatar Dec 09 '24 19:12 anmolnar

@kezhuw Thoughts?

anmolnar avatar Dec 09 '24 19:12 anmolnar

We ran into perf issue when had sync on the ZookeeperServer object. Whenever snapshot was taken, the sync operations were queued up in the outstanding request queue and the sync() operation latency was largely increased. (edited) There was no synchronize between snapshotting and sync() before. The intention of adding the synchronized was to takeSnapshot() and restore() was to sync between the two operations. I didn’t realize the sync() call also requires ZookeeperServer object lock.

When has synchronized been added to take/restore snapshot? Are you saying that sync() was already synchronized which should prevent multiple sync's running at the same time, but accidentally you introduced snapshotting into this lock?

anmolnar avatar Dec 11 '24 20:12 anmolnar

When has synchronized been added to take/restore snapshot?

synchronized on the ZookeeperServer object was added when we introduced remote backup and restore feature, specifically https://github.com/apache/zookeeper/pull/1943

Are you saying that sync() was already synchronized which should prevent multiple sync's running at the same time, but accidentally you introduced snapshotting into this lock?

Yes. I didn't know that sync() is synchronized on FollowerZookeeperServer, which means it's also synchronized onZookeeperServer object. That's why we had lock contention between takingSnapshot() and sync().

li4wang avatar Dec 12 '24 19:12 li4wang

LGTM

it causes lock contention on the ZookeeperServer object with the sync operation.

Does anyone have any cue about why sync get synchronized ? I saw that sync is called only in one thread and pendingSyncs is concurrent safe.

I have no idea either. I think we could remove synchronize from sync() too. Or remove it only from sync().

anmolnar avatar Dec 19 '24 16:12 anmolnar

ZOOKEEPER-136 added synchronized to sync()

anmolnar avatar Dec 19 '24 16:12 anmolnar

Thanks @anmolnar and @kezhuw

I saw that sync is called only in one thread and pendingSyncs is concurrent safe. I think we could remove synchronize from sync() too. Or remove it only from sync().

sync() is synchronized on the ZookeeperServer object, which is accessed by multiple-threads. Through inheritance and composition, the sync() operation is currently synchronized with following API calls.

ZookeeperServer.enqueueRequest(), 
ZookeeperServer.submitRequestNow()
ZookeeperServer.startup()
ZookeeperServer.shutdown()
Learner.syncWithLeader()

sync() is used for strong consistency of not getting stale data in the read after write scenario. Not sure if it's safe to remove synchronization from it.

li4wang avatar Dec 20 '24 22:12 li4wang

@anmolnar @kezhuw The PR is good for merge now, right? Thanks.

li4wang avatar Dec 20 '24 22:12 li4wang