zookeeper [ZOOKEEPER-3500] Improving the ZAB UPTODATE semantic to only issue it to learner when there is limited lagging

With large snapshot and high write RPS, when learner is having SNAP syncing with leader, there will be lots of txns need to be replayed between NEWLEADER and UPTODATE packet.

Depends how big the snapshot and traffic is, from our benchmark, it may take more than 30s to replay all those txns, which means when we process the UPTODATE packet, it's still 30s lagging behind, with 10K/s txn that's 300K txns lagging.

And we start to serve client traffic just after we received UPTODATE packet, which means client will see lots of stale data.

The idea here is trying to check and only send UPTODATE packet when there is limited txns lagging behind from leader side. It doesn't change the ZAB protocol, but changed the time when ZK is applying the txns between NEWLEADER and UPTODATE.

We haven't merged this change internally, we'd like to hear some feedback here, please help review and let us know if there is any red flag of doing this.

Aug 08 '19 18:08 lvfangmin

@hanm @anmolnar @eolivelli do you have time to take a look at this, need some feedback on the idea here.

Aug 14 '19 00:08 lvfangmin

sure, this was on my list anyway.

Aug 14 '19 02:08 hanm

there will be lots of txns need to be replayed between NEWLEADER and UPTODATE packet.

I think these are the transactions queued on learner while SNAP sync is happening?

And we start to serve client traffic just after we received UPTODATE packet, which means client will see lots of stale data.

Does the stale data cause any issues? I think this is an optimization issue instead of a correctness issue, and a client will never end up in a case where it fails to see its last write when connect to a sync in flight server, and zookeeper's sequential consistency is still guaranteed. If so, wondering why seeing stale data would be an issue if the servers will finally converge?

changed the time when ZK is applying the txns between NEWLEADER and UPTODATE.

I think with this change the clients will have to expect a little bit longer recovery time in certain cases - was there any concerns / discussions around this increased recovery time on client side and the implications?

Aug 14 '19 05:08 hanm

I am thinking about this idea. I don't have much time these days (vacation) Thanks for pinging me. I will give my feedback as soon as possible

Aug 14 '19 12:08 eolivelli

@hanm to the previous comment:

I think these are the transactions queued on learner while SNAP sync is happening?

Those are the txns queued on leader while learner took snapshot when received NEWLEADER packet.

Does the stale data cause any issues? I think this is an optimization issue instead of a correctness issue, and a client will never end up in a case where it fails to see its last write when connect to a sync in flight server, and zookeeper's sequential consistency is still guaranteed. If so, wondering why seeing stale data would be an issue if the servers will finally converge?

It's not correctness issue, it's just how far behind the server is when it started to serve client traffic. If the client has last seen zxid, then it's fine, it will try to connect to a different one, but for new clients who don't have this information (like new clients), they may read stale data.

And another improvement with this change is that we can decouple the sync timeout with the data size and traffic, otherwise it may take the learner more than sync timeout to replay those txns and cannot finish SYNC.

I think with this change the clients will have to expect a little bit longer recovery time in certain cases - was there any concerns / discussions around this increased recovery time on client side and the implications?

In general, this won't affect the quorum up time, but it may take longer time for the minority server who is lagging behind too much.

Aug 19 '19 17:08 lvfangmin

Those are the txns queued on leader while learner took snapshot when received NEWLEADER packet.

Are these txns queued in queuedPackets of LearnerHandler? I was thinking you referred to the packetsNotCommitted of Learner.

Aug 29 '19 21:08 hanm

In general, this won't affect the quorum up time, but it may take longer time for the minority server who is lagging behind too much.

One case I was thinking is when a large subset of servers lagging and recovering, the zookeeper clients might have to wait longer to find a server to establish a connection, so client side's retry / timeout might have to be adjusted to deal with this change.

Aug 29 '19 21:08 hanm

+1 on the proposal.

Aug 29 '19 21:08 hanm

@eolivelli it seems tricky to test 3.5 compatibility with 3.6.

@hanm here are the answers to your questions:

Are these txns queued in queuedPackets of LearnerHandler? I was thinking you referred to the packetsNotCommitted of Learner.

Yes, those txns are queued up in the LearnerHandler, since learner was busy taking snapshot and doesn't read from socket, which put pressure back to leader and packets queued up in LearnerHandler.

One case I was thinking is when a large subset of servers lagging and recovering, the zookeeper clients might have to wait longer to find a server to establish a connection, so client side's retry / timeout might have to be adjusted to deal with this change.

It may need to retry a few servers before find a up and running one, but it should be fine, since less than minority servers will be in this state, and they will disconnect clients if they haven't finished syncing with leader yet.

Sep 11 '19 01:09 lvfangmin

zookeeper zookeeper copied to clipboard

[ZOOKEEPER-3500] Improving the ZAB UPTODATE semantic to only issue it to learner when there is limited lagging

zookeeper
zookeeper copied to clipboard