zookeeper ZOOKEEPER-3615: Provide formal specification and verification using TLA+ for Zab

We've made a formal specification for Zab in paper "High-performance broadcast for primary-backup systems" and a certain scale of model checking to verify the correctness of Zab(currently still running). There may exist differences between our specification and Zab in paper or in implementation. Please let us know flaws of specification if you have any question.

May 03 '21 06:05 BinyuHuang-nju

@Vanlightly @fpj probably you are interested in this PR

May 03 '21 06:05 eolivelli

Yes, I will review it and give feedback.

May 03 '21 09:05 Vanlightly

@BinyuHuang-nju I have run it a couple of times and hit some invariant violations, such as LocalPrimaryOrder and Integrity. I have not taken the time to understand the spec or these invariants yet but would you be able to specify which invariants you have enabled when you run it? I am using Servers {s1,s2} and Values {v1} as sets of model values, with the rest of the constants as model values.

May 03 '21 09:05 Vanlightly

@Vanlightly Sorry. Ok, I have checked invariants including Leadership, PrefixConsistency, Integrity and Agreement, and these invariants appear good for several days. In principle, all invariants should keep unviolated when we can guarantee that Leadership and PrefixConsistency are true anytime. So I just focus on Leadership and PrefixConsistency. The invariants that hit violations are both related to variable proposalMsgsLog. So I think I made mistakes in writing spec related to proposalMsgsLog or in writing PrefixConsistency. Please let me check this spec and reply to you later, and you can check whether invariants Leadership and PrefixConsistency will hit violations. Thank you for your feedback very much!

May 03 '21 12:05 BinyuHuang-nju

Thanks for pushing the limits of my memory, hopefully I'll be able to help out here.

Local primary order is about the delivery order of messages. I had a quick look at the spec and it is expecting a message cardinality of 2 (Cardinality(mentries(i, e)) ≥ 2), so it should correctly evaluate to false in the absence of multiple values, no?

As for integrity, it is about not creating values out of the thin air, so it should be straightforward to satisfy. I had a quick look, but nothing stood out for me so far.

May 03 '21 14:05 fpj

@Vanlightly @fpj I am very glad to see your replies!!

I have found faults in my spec that resulted in invariant violations about LocalPrimaryOrder and Integrity. In writing LocalPrimaryOrder, I wrongly wrote 'A=>B' as 'A /\ B'. Invariant Integirty means, if some follower delivers one transaction, then some primary must have broadcast it. But I only considered broadcast in msg with type PROPOSE from phase3(normal case), and ignored broadcast in msg with type NEWLEADER from phase2(Sync).

Now, I have fixed these bugs, and have added this commit. You could download new spec and run models. Until now, by using Servers {s1,s2} and Values {v1,v2}(it is not important to change CONSTANT Value) as sets of model values, and adding all invariants that I defined, I have not found errors.

I will run this model and update data tomorrow. Since it is half past ten in China, I have to go home, and I will check your replies tomorrow.

Thanks again for your attention!! (By the way, since this is a project independent of zookeeper, and I am a rookie in using Travis CI, how can I pass Travis CI building?

May 03 '21 14:05 BinyuHuang-nju

Thanks for the pull request @BinyuHuang-nju , glad someone finally picked this up. I'll provide some feedback this week.

Quick notes:

how can I pass Travis CI building

You would need put Apache License on the TLA specification files: Zab.tla and ZabWithQTest.tla. Hopefully this will make the build happy. Just give you an example of what the license looks like: https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L1

Also, can you please update the title of the pull request with a JIRA number so this pull request can get merged after it's reviewed? I think ZOOKEEPER-3615 is a good candidate, and I assume @maoling don't mind that you steal this JIRA from them.

May 03 '21 18:05 hanm

@hanm Thank you. I will update the title and try this building way later.

May 04 '21 06:05 BinyuHuang-nju

I left the new version running on my workstation overnight with Servers {s1,s1} symmetrical and Values {v1,v2} symmetrical and no violations happened. It was still running this morning after 16 hours and I had to stop it. It reached diameter 66, distinct states 2,370,569,088, 127GB storage and interestingly four actions with 0 states: FindCluster, LeaderHandleCEPOCHinPhase3, LeaderHandleACKLDinPhase3 and BecomeFollower.

Again, I haven't had time to dig into the spec itself. I will take some time soon to understand each action and invariant. The numbers above do exceed what you have reported though and the four actions without states could indicate a problem.

May 04 '21 07:05 Vanlightly

@Vanlightly It is correct to find these actions not executed. Because when there are only two servers, the quorum is just these two servers. And these actions should be executed when server number is larger than 2. Like FindCluster, this action means a server after restart wants to join a cluster, but it will never receive responses from a quorum, because there is only one left server and it can not construct a quorum.

Thank you for your attention! I promise you will find these actions reached when you set server number like 3,4,5.

May 04 '21 07:05 BinyuHuang-nju

@BinyuHuang-nju with 3 servers and 1 value it now hits all actions. It's been running for 20 hours and reached a diameter of 22 with over 17 trillion states, 2 trillion unique states and the queue size is over 1 trillion and growing. When I ran it with 2 servers it had reached a diameter of over 60 when I killed it. I see there are some constraints regarding epochs in the enabling conditions of the actions to keep it to only two epochs but even so we've got a very large state space.

In your results you describe much smaller state counts. Have you run this to completion or only as far as the state counts you have described? If you have run it to completion, what was your model?

May 05 '21 07:05 Vanlightly

@Vanlightly Sorry, I didn't make it clear that I didn't do complete running. Actually, It is almost impossible to traverse all states due to state space explosion, and this is a disadvantage of "model checking". So it is very normal when you see the model cannot stop.

We can assume situations when a server halts and restarts for multiple times, or re-elect a leader for multiple times but not increase epoch because the last round of elections ended before the prospective leader broadcasts NEWEPOCH. So in principle, there are infinite large state spaces when running models. On the other hand, there are a lot of repetitive actions like restart, re-election happen continuously, which make state space expand rapidly.

So in actual, I run it for a long time to judge whether there's invariant violation or not.

Also, I have run models with excluding the restart situation(modify "currentEpoch[i] < 0" in Restart and RecoveryAfterRestart), and the result will produce a lot of actions we care more about. Until now, the results appears good.

If you have any needs on model's scale, we will actively cooperate within our capacity. Thank you for your investment in verification!

May 05 '21 08:05 BinyuHuang-nju

@Vanlightly Hi! Since I previously thought that infinite state space is normal, I didn't pay much attention to how to constrain state space. Thanks for your reminder, I think I can modify spec ZabWithQTest.tla, to try to let models be run completely.

I added global variables electionNum, totalRestartNum to restrain state space and try to run models to completion. I let electionNum < MaxElectionNum in Election, let totalRestartNum < MaxTotalRestartNum in Restart and RecoveryAfterRestart, and let Len(history[i]) < MaxTransactionNum in ClientRequest. When I set 2 servers, MaxElectionNum = 2, MaxTotalRestartNum = 3, MaxTransactionNum = 2, the model was run completely in 1 minute, and reached a diameter of 47 with 2518479 states. In this case, it is normal that actions involving election and restart and some low-probability actions such as BecomeFollower will not be traversed.

So this time We may be able to run the model completely when the scale is not very large. I will try this way with bigger values like 3 servers, and tell you results tomorrow if it can be run completely. And I will push new spec here immediately.

But, in principle state space is infinite in this spec using model checking. This way in spec makes my verification more like a testing, not model checking. I think it is a trade-off between verification of Zab correctness and model checking.

May 05 '21 14:05 BinyuHuang-nju

@Vanlightly Hi! When I set 3servers with parameters MaxTotalRestartNum = 1,MaxElectionNum = 1,MaxTransactionNum = 2, the model cannot stop after 31 hours and I have to stop running. It can be seen that even with 3 servers, the speed of state space growth is unimaginable.

But it reached a diameter of 45 with 9,602,018,536 states. Compared to previous models that reached a diameter of about 21 in 15 hours when setting 3 servers, now the model can reach a diamater of 43 in 15 hours. Obviously the state space has been greatly compressed, but it is still difficult to see the termination.

However, most of states consisted of actions from timeout, recovery and phase1(Discovery) before, and now actions from phase3 (Broadcast) occupy a considerable part. This is what we want to see, because states those reach actions in phase3 are more likely to go wrong. So the current spec may have better error detection capability than the past spec.

May 07 '21 11:05 BinyuHuang-nju

@BinyuHuang-nju good work getting it to reach a higher diameter, that improves our confidence in the spec. I will start digging into the functioning of the spec itself and provide feedback. I am interested in seeing if I can cause violations to occur by making subtle changes to the spec (making sure it detects problems that I introduce), I'm also interested in profiling it to see if any Java overrides could improve performance.

May 10 '21 07:05 Vanlightly

@Vanlightly Thank you! There exists differences between zab in paper and zab in my spec. Please forgive me if it causes confusion in understanding spec. We are thinking about other ways to compress state space of model checking more significantly without affecting correctness. But it will take a longer time when we can achive a good result if we choose to do it. We will be very happy if you could provide some comments or suggestions in spec and model checking!!

May 10 '21 12:05 BinyuHuang-nju

@hanm Maybe it would be inappropriate for me to ask the following questions under this PR. Now I am learning ZAB 1.0 and prepare to write spec. Could you please let me ask about details in it?

Currently I have several problems about ZAB 1.0.

For some detailed industial implementation (like message with type PING, connection between leader and followers), do I need to implement these in spec?
I see before sending NEWLEADER, leader will choose a best message from SNAP,TRUNC,DIFF to send according to the corresponding follower's state. Do I need to implement these, or abstract this part and just send NEWLEADER to sync with followers?
I don't understand whether variable 'loracle' is more like a global variable that has the latest leader ID and every server can reach it, or 'loracle' is a local variable like 'votedFor' in Raft. (Actually I used the latter in Zab.tla, which corresponds to 'leaderOracle'. And I think it has no effect on correctness no matter which one is used.)
The last problem what I want to ask is that about recovery, because I didn't see notes about this part in paper and the link. I saw how the leader handles when a new server wants to join them, but I didn't see how a restarted server finds the latest leader. If 'loracle' is the former in question 3, I can understand this part. In Zab.tla, I use the method where the follower broadcasts messages to ask other servers what their local oracle is, and update its oracle when receiving the same oracle and epoch from a quorum(I used this method because I saw servers recover like this in View-stamped Replication). A follower must have received corresponding PROPOSAL when receiving COMMIT, if the follower is the initial server that joins Q. But I think this may not always be true when the follower is one that joins Q midway. So I want to know how follower handles to catch up state when receiving COMMIT corresponding to a transaction that not exists in its local history.(You could see this condition in the image at the bottom of README. When I wrote spec for Zab pre-1.0, I choose to let followers keep re-sending CEPOCH to obtain latest transactions until the conflict described above disappers.)

I am very sorry if my problems bother you. Thank you for the various feedbacks and suggestions you have provided before!

May 19 '21 14:05 BinyuHuang-nju

From my perspective:

Implement the minimum required, in order to keep thing simpler and reduce the state space. If the only point of the PING message is to act as a keep alive when there is no data to replicate then don't include it. Do not implement timeouts explicitly, rather just allow for a node to go back to an election at any time.
SNAP and DIFF exist as they optimize performance for the given scenario. For the TLA+ we don't worry about that. I would implement DIFF or SNAP but not both. SNAP would result in a smaller state space as it is only a single message. TRUNC as far as I know is when a follower is ahead of the new leader (with uncommitted entries) so would be required.

Regarding 3 and 4, unfortunately the Zab 1.0 document is lacking details so someone with deep knowledge of the protocol will need explain this missing pieces. If I have time I might read the code to reverse engineer it back into detailed a protocol description.

May 19 '21 18:05 Vanlightly

@hanm Maybe it would be inappropriate for me to ask the following questions under this PR. Now I am learning ZAB 1.0 and prepare to write spec. Could you please let me ask about details in it?

Currently I have several problems about ZAB 1.0.

For some detailed industial implementation (like message with type PING, connection between leader and followers), do I need to implement these in spec?

It depends on if the particular message type is critical to reveal important details of the protocol. Messages like NEWLEADER or UPTODATE is an integral part of the protocol, I don't think we can abstract it away. For Ping though, I think we can omit it for now.

I see before sending NEWLEADER, leader will choose a best message from SNAP,TRUNC,DIFF to send according to the corresponding follower's state. Do I need to implement these, or abstract this part and just send NEWLEADER to sync with followers?

I am inclined to abstract this part away using "RECOVERY_SYNC" operation. In fact, that's what the original pre-1.0 did where a leader always sync learners with full history. We can go more fine grained later. This will hopefully reduce the state space.

I don't understand whether variable 'loracle' is more like a global variable that has the latest leader ID and every server can reach it, or 'loracle' is a local variable like 'votedFor' in Raft. (Actually I used the latter in Zab.tla, which corresponds to 'leaderOracle'. And I think it has no effect on correctness no matter which one is used.)

ZAB 1.0 and its implementation does not use this "loracle". Instead, a server always starts a leader election and get the latest later ID through the leader election results. If we want an abstraction here, I would tend to think "loracle" is a global variable.

The last problem what I want to ask is that about recovery, because I didn't see notes about this part in paper and the link. I saw how the leader handles when a new server wants to join them, but I didn't see how a restarted server finds the latest leader. If 'loracle' is the former in question 3, I can understand this part. In Zab.tla, I use the method where the follower broadcasts messages to ask other servers what their local oracle is, and update its oracle when receiving the same oracle and epoch from a quorum(I used this method because I saw servers recover like this in View-stamped Replication).

As mentioned earlier, a restarted (or newly joined) server finds latest leader through leader election.

A follower must have received corresponding PROPOSAL when receiving COMMIT, if the follower is the initial server that joins Q. But I think this may not always be true when the follower is one that joins Q midway. So I want to know how follower handles to catch up state when receiving COMMIT corresponding to a transaction that not exists in its local history.(You could see this condition in the image at the bottom of README. When I wrote spec for Zab pre-1.0, I choose to let followers keep re-sending CEPOCH to obtain latest transactions until the conflict described above disappers.)

The "follow handles to catch up" part of the protocol is the recovery part (SNAP, DIFF, TRANC). Follower will not start broadcast unless the recovery phase is finished. The invariant here is that once recovery finished, follower should have the latest history of quorum / leader so any COMMIT it receives has a corresponding entry in its history. This is different comparing to Paxos or Raft where there was no dedicated recovery phase.

I am very sorry if my problems bother you. Thank you for the various feedbacks and suggestions you have provided before!

No worries, I am not super active in community these days, but I will answer questions to the issues that I am involved in (with undefined SLA).

May 24 '21 16:05 hanm

@Vanlightly @hanm Thank you for your replies! Your interpretation of the problems will simplify my spec of zab 1.0.

May 25 '21 12:05 BinyuHuang-nju

@Vanlightly @hanm Hi! I have read some code about zab. But I still cannot understand some details. I want to ask you about them.

At line 630 in function 'syncWithLeader' In Learner.java, I see follower in phase SYNCHRONIZATION may receive PROPOSAL and COMMIT. But in paper, leader will add a new server into its 'forwardingFollowers'(Q in paper) only when receiving ack of NEWLEADER from it, so in paper this new server should not receive PROPOSAL or COMMIT because in phase SYNCHRONIZATION it has not sent ACK-NEWLEADER. So here, my question is in ZAB 1.0, when does leader add a newly joined server into its 'forwardingFollowers'?(I am sorry that I didn't find this in code)
My understanding of question above is: in phase 2, leader syncs with each follower and ''sends'' NEWLEADER only to this corresponding follower that has completed sync, not ''broadcasts'' NEWLEADER. So when leader is in phase 3, there may exist some follower in phase 2, and has not completed sync with leader since it hasn't received NEWLEADER, but it is allowed to receive PROPOSAL and COMMIT. I am not sure whether my understanding is right. If so, I think my spec about phase 2 is that, for each connected follower: first, leader syncs with the follower by an abstract action RECOVERY_SYNC; second, in the action, follower will have the latest history of leader, and leader sends NEWLEADER to it; then, leader and follower communicate using ACK-NEWLEADER and UPTODATE and follower turns to phase 3. Leader turns to phase 3 upon receiving ACK-NEWLEADER from a quorum.
At line 766 in function 'syncWithLeader' in Learner.java, I see when receiving NEWLEADER, the variable 'packetsNotCommitted' is set as empty, and when receiving UPTODATE, there is no action handling transactions left to commit them. Does that mean follower thinks all transactions committed upon receiving NEWLEADER not UPTODATE? And it seems like that NEWLEADER, ACK-LEADER, UPTODATE carries little data except epoch, just used for acknowledgement between leader and followers.
In phase 1 in manual tha we mentioned above, there's one word: "Note, if a follower is connecting, but the leader is already established (in phase 3) the follower follows the phases, but the leader ignores any ACKs." It seems like that leader in phase 3(BROADCAST) will ignre ACKEPOCH in phase1 and ACK of NEWLEADER in phase 2, then how dose this newly connecting follower receive UPTODATE to go to phase BROADCAST?

Since these details are related to how to let follower sync with leader in my spec of ZAB 1.0, I want to have a further understanding of them. Thank you!

May 29 '21 14:05 BinyuHuang-nju

At line 630 in function 'syncWithLeader' In Learner.java, I see follower in phase SYNCHRONIZATION may receive PROPOSAL and COMMIT. But in paper, leader will add a new server into its 'forwardingFollowers'(Q in paper) only when receiving ack of NEWLEADER from it, so in paper this new server should not receive PROPOSAL or COMMIT because in phase SYNCHRONIZATION it has not sent ACK-NEWLEADER. So here, my question is in ZAB 1.0, when does leader add a newly joined server into its 'forwardingFollowers'?(I am sorry that I didn't find this in code)

You can't find one, because there is no such thing in production. In production, a newly joined server will connect to leader first, the leader will spin up a dedicated handler thread to sync this follower. You can check LearnerHandler implementation for details.

The other important detail here is - in paper, the NEWLEADER does not have the same semantic as the actual production implementation. In paper, the NEWLEADER(e, Ie) carries the epoch AND the leader history, but the actual implementation of NEWLEADER is a pure control operation rather than a mixed control + data operation as described in paper.

My understanding of question above is: in phase 2, leader syncs with each follower and ''sends'' NEWLEADER only to this corresponding follower that has completed sync, not ''broadcasts'' NEWLEADER. So when leader is in phase 3, there may exist some follower in phase 2, and has not completed sync with leader since it hasn't received NEWLEADER, but it is allowed to receive PROPOSAL and COMMIT.

As previously mentioned, because each follower sync is handled by a dedicated handler thread, there is no such state as "leader is in phase X". We can say that the "handler of follower X is in phase Y", though. Because of this, we don't need worry about "So when leader is in phase 3, there may exist some follower in phase 2", which will not happen, because its the handler thread, instead of leader, that's in different phases; and because one to one mapping between handler thread and follower, their phase always match.

I am not sure whether my understanding is right. If so, I think my spec about phase 2 is that, for each connected follower: first, leader syncs with the follower by an abstract action RECOVERY_SYNC; second, in the action, follower will have the latest history of leader, and leader sends NEWLEADER to it; then, leader and follower communicate using ACK-NEWLEADER and UPTODATE and follower turns to phase 3. Leader turns to phase 3 upon receiving ACK-NEWLEADER from a quorum.

The sequences of events is correct.

At line 766 in function 'syncWithLeader' in Learner.java, I see when receiving NEWLEADER, the variable 'packetsNotCommitted' is set as empty, and when receiving UPTODATE, there is no action handling transactions left to commit them.

When follower receives NEWLEADER, it will apply all previous queued proposals from leader atomically to its state machine. The UPTODATE is merely a control operation to signal follower that the leadership and quorum is established, and this specific follower can start serving traffic.

Does that mean follower thinks all transactions committed upon receiving NEWLEADER not UPTODATE? And it seems like that NEWLEADER, ACK-LEADER, UPTODATE carries little data except epoch, just used for acknowledgement between leader and followers.

I don't quite get your first question here. The remaining part is correct, these are control operations that mark the milestones of the sync phase, and they don't carry replicated data.

In phase 1 in manual tha we mentioned above, there's one word: "Note, if a follower is connecting, but the leader is already established (in phase 3) the follower follows the phases, but the leader ignores any ACKs." It seems like that leader in phase 3(BROADCAST) will ignre ACKEPOCH in phase1 and ACK of NEWLEADER in phase 2, then how dose this newly connecting follower receive UPTODATE to go to phase BROADCAST?

As previously mentioned, it's the handler thread in leader that handles the sync phase. Each handler thread will have different state but this state will be kept in sync with the corresponding follower where the handler thread is designated to.

Since these details are related to how to let follower sync with leader in my spec of ZAB 1.0, I want to have a further understanding of them. Thank you!

I would recommend read https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0 - which is exactly the same as production implementation. Maybe treat this as source of truth and use the paper as advisory only.

Jun 02 '21 03:06 hanm

@hanm OK, thank you very much! You replies above will help me a lot. Currently I am devoted to writing spec for ZAB 1.0, and abstract two modules of election and sync phase before sending NEWLEADER. I think I can implement election module if necessary. And I need to learn more about sync phase if we need to implement this part.

Jun 03 '21 14:06 BinyuHuang-nju

@hanm There's one point I don't get.

When follower receives NEWLEADER, it will apply all previous queued proposals from leader atomically to its state machine.

It seems like that when some learnerMaster sends NEWLEADER to the corresponding follower, the leader may have not completed sync with a quorum of followers. So some transaction saved by this follower may have not been saved by a quorum, which is not committed and can not be applied from my perspective. Here I don't understand why these queued proposals can be applied. Thank you!

Jun 07 '21 06:06 BinyuHuang-nju

@hanm There's one point I don't get.

When follower receives NEWLEADER, it will apply all previous queued proposals from leader atomically to its state machine.

It seems like that when some learnerMaster sends NEWLEADER to the corresponding follower, the leader may have not completed sync with a quorum of followers. So some transaction saved by this follower may have not been saved by a quorum, which is not committed and can not be applied from my perspective. Here I don't understand why these queued proposals can be applied. Thank you!

You are correct that NEWLEADER does not imply the transactions synced from leader on a follower is quorum committed. That's fine because the follower is not serving traffic (or visible to clients) so sequential consistency is not violated. Later, when leader gets a quorum of ACK of the NEWLEADER from followers, leader will send UPTODATE to the followers and after this the quorum and leadership is established. Followers can serve traffic after receiving UPTODATE.

Jun 11 '21 04:06 hanm

@hanm Thank you! I am now writing this spec and I will push it here. At that time, I hope you can help me check whether my spec is logically deviated from zab 1.0 in production, especially in SYNC part. I will implement those parts now we choose to abstract like FLE if you think it is necessary!

Jun 12 '21 14:06 BinyuHuang-nju

@hanm @Vanlightly Hi! Now I have pushed spec of zab-1.0 here. In spec, I use action RECOVERYSYNC to abstract the phase of SYNC before leader sending NEWLEADER. Due to the differences between formal specification and engineering implementation, and the purposes of reducing state space, there are differences in my expression and code in some places. But I think this does not affect the equivalence between the two. For example, in spec, only the leader in broadcast phase will accept the requests, which is to omit the extra space brought by the follower’s acceptance of client’s request. I will be very glad to see your feedbacks! And I will reply to you as soon as possible if you have any question!

Jul 15 '21 08:07 BinyuHuang-nju

I have started looking at the new version. One thing I have done is add a TypeOK invariant checking that the types of all variables are correct in all states. This helps me understand the spec better as it acts as a reference for what the different variables hold but can also be useful for catching type errors. I have it running with simulation for a few hours so far with no violations.

Here's the TypeOK:

ServersIncNullPoint == Server \union {NullPoint} 

Zxid ==
    Seq(Nat)
    
HistoryItem ==
     [epoch: Nat,
      counter: Nat,
      value: Value]    
    
Proposal ==
    [msource: Server, mtype: {"RECOVERYSYNC"}, mepoch: Nat, mproposals: Seq(HistoryItem)] \union
    [msource: Server, mtype: {PROPOSAL}, mepoch: Nat, mproposal: HistoryItem]   

Message ==
    [mtype: {FOLLOWERINFO}, mepoch: Nat] \union
    [mtype: {NEWLEADER}, mepoch: Nat, mlastZxid: Zxid] \union
    [mtype: {ACKLD}, mepoch: Nat] \union
    [mtype: {LEADERINFO}, mepoch: Nat] \union
    [mtype: {ACKEPOCH}, mepoch: Nat, mlastEpoch: Nat, mlastZxid: Zxid] \union
    [mtype: {UPTODATE}, mepoch: Nat, mcommit: Nat] \union
    [mtype: {PROPOSAL}, mepoch: Nat, mproposal: HistoryItem] \union
    [mtype: {ACK}, mepoch: Nat, mzxid: Zxid] \union
    [mtype: {COMMIT}, mepoch: Nat, mzxid: Zxid]
    
ElectionState == {LOOKING, FOLLOWING, LEADING}

Vote ==
    [proposedLeader: ServersIncNullPoint,
     proposedZxid: Zxid,
     proposedEpoch: Nat]
     
ElectionVote ==
    [vote: Vote, round: Nat, state: ElectionState, version: Nat]

ElectionMsg ==
    [mtype: {NOTIFICATION}, msource: Server, mstate: ElectionState, mround: Nat, mvote: Vote] \union
    [mtype: {NONE}]
            
TypeOK ==
    /\ zabState \in [Server -> {ELECTION, DISCOVERY, SYNCHRONIZATION, BROADCAST}]
    /\ acceptedEpoch \in [Server -> Nat]
    /\ history \in [Server -> Seq(HistoryItem)] 
    /\ commitIndex \in [Server -> Nat]
    /\ learners \in [Server -> SUBSET ServersIncNullPoint]
    /\ cepochRecv \in [Server -> SUBSET ServersIncNullPoint]
    /\ ackeRecv \in [Server -> SUBSET ServersIncNullPoint]
    /\ ackldRecv \in [Server -> SUBSET ServersIncNullPoint]
    /\ ackIndex \in [Server -> [Server -> Nat]]
    /\ currentCounter \in [Server -> Nat]
    /\ sendCounter \in [Server -> Nat]
    /\ committedIndex \in [Server -> Nat]
    /\ committedCounter \in [Server -> [Server -> Nat]]
    /\ forwarding \in [Server -> SUBSET ServersIncNullPoint]
    /\ initialHistory \in [Server -> Seq(HistoryItem)] 
    /\ tempMaxEpoch \in [Server -> Nat]
    /\ cepochSent \in [Server -> BOOLEAN]
    /\ leaderAddr \in [Server -> ServersIncNullPoint]
    /\ synced \in [Server -> BOOLEAN]
    /\ proposalMsgsLog \in SUBSET Proposal
    /\ epochLeader \in [1..MAXEPOCH -> SUBSET ServersIncNullPoint]
    /\ inherentViolated \in BOOLEAN
    /\ forwarding \in [Server -> SUBSET Server]
    /\ msgs \in [Server -> [Server -> Seq(Message)]]
    
    \* Fast Leader Election
    /\ electionMsgs \in [Server -> [Server -> Seq(ElectionMsg)]]
    /\ recvQueue \in [Server -> Seq(ElectionMsg)]
    /\ leadingVoteSet \in [Server -> SUBSET Server]
    /\ receiveVotes \in [Server -> [Server -> ElectionVote]]
    /\ currentVote \in [Server -> Vote]
    /\ outOfElection \in [Server -> [Server -> ElectionVote]]
    /\ lastZxid \in [Server -> Zxid]
    /\ state \in [Server -> ElectionState]
    /\ waitNotmsg \in [Server -> BOOLEAN]
    /\ currentEpoch \in [Server -> Nat]
    /\ logicalClock \in [Server -> Nat]

I still need to dig into the logic further so I understand it and check it against ZK code.

Aug 31 '21 15:08 Vanlightly

@Vanlightly It's very nice for you to give me the suggestions. I will take it into spec, and update spec later! Thank you! If you find any question, please just tell me anytime.

Sep 04 '21 05:09 BinyuHuang-nju

I have finished my review of the spec against the code and I do not find any problems in terms of correctness. There are however, a number of suggestions.

Some action names are not very helpful, see my comments in code
The FastLeaderElection module is too focused on the production implementation in my opinion. One of the benefits of TLA+ is being free of programming considerations like multi-threading. The fact that the FastLeaderElection implementation uses the internal queues recvqueue and sendqueue is an implementation detail that we don't need to replicate in the TLA+. I think you could combine ReceiveNotmsg with HandleNotmsg. Also the WaitNewNotmsg and WaitNewNotmsgEnd are not optimized for state space and are also perhaps, implementation oriented. I think there are some big opportunities to make FLE simpler and with a smaller state space.
Consider adding the test variables of ZabWithFLETest to the main spec, using a view to remove those variables from state space exploration and using state constraints to limit state space exploration using those variables. Then this test spec is not required.
Add some additional comments to explain what is going on in each action.

The above are not blockers, just suggestions.

Sep 09 '21 15:09 Vanlightly

zookeeper zookeeper copied to clipboard

ZOOKEEPER-3615: Provide formal specification and verification using TLA+ for Zab

zookeeper
zookeeper copied to clipboard