CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Invariant ``MoreUpToDateCorrectInv`` violated in TLA+ specification

Open heidihoward opened this issue 3 years ago • 9 comments

Describe the bug TLC found a violation of the MoreUpToDateCorrectInv invariant when checking the current TLA+ specification in https://github.com/microsoft/CCF/blob/main/tla/raft_spec/ccfraft.tla.

Note that this is different TLA+ specification (and thus a different issue) from #3828.

To Reproduce Use TLC to check the current spec in https://github.com/microsoft/CCF/blob/main/tla/raft_spec/MCraft.tla using a term limit of 3. As you can see from the output below, TLC took >2 days to find this bug (after checking 37 billon states) on a well-resourced machine so I'll need to be patient.

Expected behavior TLC should have completed checking without finding a safety violation.

Environment information Host: Azure standard_HB120rs_v3 OS: Ubuntu 20.04 LTS TLC: TLC2 Version 2.17 of 02 February 2022

Additional context The final output & counter example provided by TLC is copied below:

State 27: <ClientRequest line 351, col 5 to line 362, col 80 of module ccfraft>
/\ ReconfigurationCount = 1
/\ committedLogDecrease = FALSE
/\ messages = { [ mtype |-> RequestVoteRequest,
    mterm |-> 2,
    mlastLogTerm |-> 0,
    mlastLogIndex |-> 0,
    msource |-> NodeTwo,
    mdest |-> NodeThree ] }
/\ clientRequests = 3
/\ matchIndex = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeTwo :>
      ( NodeOne :> 2 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeThree :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFour :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFive :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) )
/\ messagesSent = ( NodeOne :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeTwo :>
      ( NodeOne :> <<1, 1>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeThree :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeFour :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeFive :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) )
/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2] >> @@
  NodeTwo :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1] >> @@
  NodeThree :> <<>> @@
  NodeFour :> <<>> @@
  NodeFive :> <<>> )
/\ state = ( NodeOne :> Leader @@
  NodeTwo :> Leader @@
  NodeThree :> Follower @@
  NodeFour :> Pending @@
  NodeFive :> Pending )
/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeTwo :>
      ( NodeOne :> 1 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeThree :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFour :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFive :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) )
/\ commitIndex = ( NodeOne :> 0 @@
  NodeTwo :> 4 @@
  NodeThree :> 0 @@
  NodeFour :> 0 @@
  NodeFive :> 0 )
/\ currentTerm = ( NodeOne :> 3 @@
  NodeTwo :> 2 @@
  NodeThree :> 3 @@
  NodeFour :> 1 @@
  NodeFive :> 1 )
/\ committedLog = << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
   [term |-> 2, contentType |-> TypeSignature, value |-> 0],
   [term |-> 2, contentType |-> TypeEntry, value |-> 1],
   [term |-> 2, contentType |-> TypeSignature, value |-> 1] >>
/\ nextIndex = ( NodeOne :>
      ( NodeOne :> 3 @@
        NodeTwo :> 3 @@
        NodeThree :> 3 @@
        NodeFour :> 3 @@
        NodeFive :> 3 ) @@
  NodeTwo :>
      ( NodeOne :> 3 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeThree :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeFour :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeFive :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) )
/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@
  NodeTwo :> {NodeOne, NodeTwo} @@
  NodeThree :> {} @@
  NodeFour :> {} @@
  NodeFive :> {} )
/\ votesSent = ( NodeOne :> TRUE @@
  NodeTwo :> TRUE @@
  NodeThree :> FALSE @@
  NodeFour :> FALSE @@
  NodeFive :> FALSE )
/\ votedFor = ( NodeOne :> NodeOne @@
  NodeTwo :> NodeTwo @@
  NodeThree :> NodeOne @@
  NodeFour :> Nil @@
  NodeFive :> Nil )
/\ Configurations = ( NodeOne :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>, <<1, {NodeTwo}>>>> @@
  NodeTwo :> <<<<1, {NodeTwo}>>>> @@
  NodeThree :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> @@
  NodeFour :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> @@
  NodeFive :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> )
/\ commitsNotified = ( NodeOne :> <<0, 0>> @@
  NodeTwo :> <<0, 0>> @@
  NodeThree :> <<0, 0>> @@
  NodeFour :> <<0, 0>> @@
  NodeFive :> <<0, 0>> )

37797101298 states generated, 8813296222 distinct states found, 4094258023 states left on queue.
The depth of the complete state graph search is 29.
The average outdegree of the complete state graph is 2 (minimum is 0, the maximum 31 and the 95th percentile is 2).
Finished in 2d 05h at (2022-05-12 22:38:16)

heidihoward avatar May 13 '22 14:05 heidihoward

I am probably missing something here, but

/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@

suggests node 3 is running for election. But

/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@
  NodeTwo :> {NodeOne, NodeTwo} @@
  NodeThree :> {} @@
  NodeFour :> {} @@
  NodeFive :> {} )

suggests that other nodes are receiving votes!?

achamayou avatar May 13 '22 15:05 achamayou

votesRequested is first indexed by candidate sending the RequestVote message and then by the node which received the message (it ignores when candidates vote for themselves).

So the following state means that node 1 requested a vote from node 3

/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@

Likewise, votesGranted is indexed by the candidate who was granted the vote so the following means that Node 1 received a vote from itself and from node 3.

/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@

I can update the current documentation of votesRequested to make this clearer. https://github.com/microsoft/CCF/blob/2a048e2ef9da7a5b09e759f232b7abe0fe497e93/tla/raft_spec/ccfraft.tla#L119-L121

I could also add an invariant that checks that if a vote was granted (except by the candidate itself) then it must have been requested.

heidihoward avatar May 16 '22 08:05 heidihoward

This issue seems to have been caused by a problem with the logic for determining what set of servers constitutes a quorum when a candidate is trying to become a leader whilst having an uncommitted but signed reconfiguration tx in its log.

In this particular case, node 1 becomes leader in term 3 after receiving votes from itself and node 3. However, the first entry in its log is a reconfiguration transaction which changes the set of servers to just node 2. This transaction has been committed by node 2 (and thus the configuration has changed) but node 1 does not know that yet. The full trace is here (ccf-reconfig-bug.txt) if anyone is interested.

I believe a candidate should get a quorum in each of the current configurations before becoming a leader. The TLA+ spec calculates the set of servers in all active configurations and requires a quorum across this set instead. The code is below: https://github.com/microsoft/CCF/blob/e4362abb02fffc088a5710dd37b90b5f8b114a0c/tla/raft_spec/ccfraft.tla#L328-L330

This explains why node 1 was able to become a leader without a vote from node 2. This approach could cause other issues as well, for instance, if a system was switching from nodes 1, 2 and 3 to nodes 4, 5, 6 then a new leader could be elected with only the support of nodes 3, 4, 5 and 6, leaving a majority quorum (node 1 and 2) in the old configuration who might not know about the new configuration.

I have some thoughts about how to patch this in the spec but would like to discuss this to work out what CCF is doing in practice

heidihoward avatar May 25 '22 17:05 heidihoward

@heidihoward the implementation does use a quorum across all active nodes:

  • https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1976
  • https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L2130

as opposed to a quorum in each active configuration. I have not been able to find any notes about why that choice was made, but I remember it was discussed at least once.

I suspect MoreUpToDateCorrectInv does not hold with the currently implemented election scheme, which is based on committable watermarks rather than committed watermarks: https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1642

The best entry point for history on why this change was made is probably #589.

achamayou avatar May 25 '22 20:05 achamayou

So I think there are two separate issues here which are worth considering separately:

  1. I am curious about the idea of quorums across all active configurations instead of a quorum in each active configuration. It is not obvious to me why this safe. As demonstrated by node 1 in the example, this violated the constraint that: if a node becomes a leader in some term t then its log contains all entries which were committed in terms <t. This constraint is useful as it is the basis for why it's safe for followers to overwrite log entries if they are told to do so by the leader (as the leader will already know all the committed entries).
  2. I agree that MoreUpToDateCorrectInv doesn't match the current election scheme. In my mind, txs after last signed txs are provisional and are ignored for the purpose of leader election. This invariant can fixed by using last signed index instead of last index.

heidihoward avatar May 26 '22 13:05 heidihoward

Where C represents the commit level (last commit_idx received from primary) and c represents the committable level (last_idx containing a valid signature from the primary).

R represents a reconfiguration transaction from (0, 1, 2) to (2).

Node 2 was primary, until a partition occurs.

Node 0: |--------C
Node 1: |--------C-R-c
--------------------------------------
Node 2: |----------R-C----c

In the each election mode, Node 2 elects itself easily, continues. Node 0 and Node 1 continuously run elections, unsuccessfully, since they can never get a majority in (2).

In the across election mode, Node 2 also elects itself and continues. But Node 0 and Node 1 elect Node 1, which replicates its log. R commits, which signals the removal of Node 1 and Node 2. Retirement details.

We think the latter is preferable, because:

  1. It does not seem to create a possibility for a fork, either R is committable, will happen eventually and the minority partition will shut down when it's committed there.
  2. The minority partition does catch up. If Node 2 was to crash immediately post partition, after having advertised a commit level further than Node 0 and Node 1 have advertised, the minority partition would never reach it if it used each election. But with across it will.

Edit: while I think this example is useful in illustrating why it seems good for elections to work this way, it does not illustrate a breach of MoreUpToDateCorrectInv as I understand it, because both post-partition primaries have in their respective logs all the entries that their electors believe are committed. So I need to go read the trace again.

achamayou avatar May 26 '22 14:05 achamayou

Ok, so it seems that:

/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2] >> @@

despite:

/\ Configurations = ( NodeOne :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>, <<1, {NodeTwo}>>>> @@

Node 1 is emitting an Entry after observing its reconfiguration becoming committable! That can't happen in the current implementation: https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L251

achamayou avatar May 26 '22 15:05 achamayou

I think at the heart of this discussion is the definition of committable. My understanding is that a committable transaction (one that is followed by a signed transaction) is not guaranteed to eventually be committed (and is thus not guaranteed to be on a majority of replicas). This differs from your claim if R is committable then its will eventually happen.

I think the problem with quorums across committable configurations is that a leader can be elected without the support of a majority quorum in the current configuration. Consider the set of replicas {n0,n1,n2}:

  1. n0 adds a reconfiguration tx for switching to the configuration {n0,n3,n4}, followed by a signature.
  2. n0 stops being the leader before replicating this reconfiguration tx to n1 or n2
  3. n0 becomes a leader. n0 now believes that it can become a leader with just the support of itself, n3 and n4 (as the full set of active nodes is {n0,n1,n2,n3,n4}). In other words, n0 can become a leader without a majority quorum in the old configuration {n0,n1.n2}.
  4. Meanwhile, n1 and n2 (which have no knowledge of the reconfiguration) elect node n1 to be the leader in same term.

heidihoward avatar May 27 '22 11:05 heidihoward

I think at the heart of this discussion is the definition of committable. My understanding is that a committable transaction (one that is followed by a signed transaction) is not guaranteed to eventually be committed (and is thus not guaranteed to be on a majority of replicas).

Yes, that's correct.

This differs from your claim if R is committable then its will eventually happen.

That claim on its own is definitely incorrect. I think it holds in the context of the example I gave, but I agree not in general, and in particular not in the example you have given.

As discussed, this has unpleasant implications for the convergence back to a commit previously advertised in at least some scenarios, like the one I've given in example. Clearly though, ending up with two actual leaders (not just a caretaker and a real leader) is much worse!

achamayou avatar May 27 '22 12:05 achamayou

Fixed by https://github.com/microsoft/CCF/pull/3965

heidihoward avatar Oct 12 '22 08:10 heidihoward