ouroboros-network icon indicating copy to clipboard operation
ouroboros-network copied to clipboard

Re-enable the ThreadNet tests in the UTxO-HD feature branch

Open jasagredo opened this issue 1 year ago • 4 comments

There are several errors related to the ThreadNet tests and therefore we disabled them. They should be re-enabled. We list here some errors we have found, just so that they are checked (their quickcheck seed):

FAIL (0.02s)
      ,*** Failed! Falsified (after 1 test):
      params: PBftParams {pbftSecurityParam = SecurityParam 9, pbftNumNodes = NumCoreNodes 3, pbftSignatureThreshold = PBftSignatureThreshold {getPBftSignatureThreshold = 0.4445444444444444}}
      Ref.PBFT result: Outcomes [Nominal]
      delegation certificates: [(CoreId (CoreNodeId 0),[]),(CoreId (CoreNodeId 1),[]),(CoreId (CoreNodeId 2),[])]
      [(CoreId (CoreNodeId 0),SoftwareVersionUpdateLabel {svuObserved = False, svuRequired = Just True}),(CoreId (CoreNodeId 1),SoftwareVersionUpdateLabel {svuObserved = False, svuRequired = Just True}),(CoreId (CoreNodeId 2),SoftwareVersionUpdateLabel {svuObserved = False, svuRequired = Just True})]
      (CoreId (CoreNodeId 0),SoftwareVersionUpdateLabel {svuObserved = False, svuRequired = Just True})
      Use --quickcheck-replay=79593 to reproduce.
      Use -p '/mkUpdateLabels anticipates instant confirmation/' to rerun this test only.

jasagredo avatar Sep 08 '22 10:09 jasagredo

The narrative of the test that I'm getting from tracing and inspecting the code.

  • This test starts 3 nodes which are expected to make a software and protocol update immediately.
  • There exists the concept of crucial txs which will be txs that will be added at every slot and every change of the ledger etc.
  • For the CoreNode 0, the crucial txs are:
    • UpdateProposal 8d148b6272fba556
    • UpdateVote { voter: 911965b94fe20652, proposal id: 8d148b6272fba556 }
  • For the CoreNode 1 the crucial txs are: UpdateVote { voter: d01c07ca8efdb443, proposal id: 8d148b6272fba556 } which fails to be added because the proposal is not yet known by the node 1.
  • For the CoreNode 2 the crucial txs are: UpdateVote { voter: b61a2dcc99c5ef53, proposal id: 8d148b6272fba556 } which fails to be added because the proposal is not yet known by the node 2.

This is the initial setup, now how the test progresses:

  1. CoreNode 0 forges a block at slot 0.
  2. At each slot, the crucial txs will be repeated. In particular CoreNode 0 claims several times that the proposal is duplicated and the vote already cast.
  3. The nodes connect with MsgInit and start chainsync, finding intersections at EBB 88900fba40903da0b4 (not genesis because on ChainSync it is treated different than Origin) and the node 0 claims to have also a point at Slot 0, hash 82f4e261cb1024469d645
  4. Nodes ask for txs to each other, node 0 answers with MsgReplyTxIds (BlockingReply ((upid:8d148b62,162) :| [(voteid: a878dba3,168)]))
  5. ChainSync stabilizes in all directions, (with MsgRollBack at the beginning etc) all to the Genesis block.
  6. 0 -> 1 MsgRollForward sending the block mentioned in (3)
  7. 1 <-> 2 they share their UpdateVote transactions
  8. 0 -> 2 MsgRollForward sending the block mentioned in (3)
  9. Nodes 1 and 2 succesfully adds its crucial tx. From this point on, they say that the vote was already cast.
  10. Nodes 1 and 2 request the body of 82f4e261cb1024469d645, they adopt it and move forward their chainsync servers

Now the test reaches the end and the final ledgers look like this:

cvsUpdateState = State {
    currentEpoch = EpochNumber {getEpochNumber = 0}
    , adoptedProtocolVersion = 0.0.0
    , adoptedProtocolParameters = ProtocolParameters {ppScriptVersion = 0, ppSlotDuration = 1000, ppMaxBlockSize = 2000000, ppMaxHeaderSize = 2000000, ppMaxTxSize = 8192, ppMaxProposalSize = 700, ppMpcThd = LovelacePortion 10000000000000, ppHeavyDelThd = LovelacePortion 5000000000000, ppUpdateVoteThd = LovelacePortion 1000000000000, ppUpdateProposalThd = LovelacePortion 100000000000000, ppUpdateProposalTTL = SlotNumber {unSlotNumber = 10}, ppSoftforkRule = SoftforkRule {srInitThd = LovelacePortion 900000000000000, srMinThd = LovelacePortion 600000000000000, srThdDecrement = LovelacePortion 50000000000000}, ppTxFeePolicy = TxFeePolicyTxSizeLinear (TxSizeLinear (Lovelace 155381) (21973 % 500)), ppUnlockStakeEpoch = EpochNumber {getEpochNumber = 18446744073709551615}}
    , candidateProtocolUpdates = []
    , appVersions = fromList []
    , registeredProtocolUpdateProposals = fromList []
    , registeredSoftwareUpdateProposals = fromList []
    , confirmedProposals = fromList []
    , proposalVotes = fromList []
    , registeredEndorsements = fromList []
    , proposalRegistrationSlot = fromList []
    }

And this is wrong, the appVersions should contain:

theProposedSoftwareVersion :: Update.SoftwareVersion
theProposedSoftwareVersion = Update.SoftwareVersion
  -- appnames must be ASCII and <= 12 characters
  (Update.ApplicationName "Dummy")
  0

The sequence of transactions propagated is as follows, all this before the block produced by 0 is propagated:

  • 1 requests from 0 proposal and vote of 0
  • 2 requests from 0 proposal and vote of 0
  • 2 requests from 1 proposal (why?)
  • 0 requests from 1 vote of 1
  • 2 requests from 1 vote of 1
  • 1 requests from 2 vote of 2
  • 0 requests from 2 vote of 2
  • 2 requests from 0 vote of 1 (why?)

Now the questions are:

  • As nodes 1 and 2 no longer complain about the votes, it means they indeed registered the update. Most likely because they were included in the forged block.
  • What is it needed to confirm the proposal? The nodes seem to register the following update: Register update: (Nothing,[],Endorsement {endorsementProtocolVersion = 1.0.0, endorsementKeyHash = KeyHash {unKeyHash = 911965b94fe206522fe8fb1683abf0bd39d9f092e9c9553ddbe35ae0}}) so they all register that node 0 endorses the update.

jasagredo avatar Sep 08 '22 11:09 jasagredo

So it seems that the flow is as follows:

When evaluating a block with the updateBody function (which is the BBODY rule), we register the updates. The update proposal comes from updateProposal = Update.payloadProposal $ blockUpdatePayload b so it has to be in the block, however, the block that is sent via BlockFetch has nothing in that field: bodyUpdatePayload = APayload {payloadProposal = Nothing, payloadVotes = [], payloadAnnotation = "\130\128\159\255"}}. That's why the proposal is not registered and then not observed.

So the question now boils down to "why node 0 creates a block and doesn't include the proposal"

jasagredo avatar Sep 08 '22 12:09 jasagredo

Transactions are accepted because they produce no errors, and yet, if we query for the isTxs of the mempool accessible from the node kernel when forging, it returns txs in mempool: []. Some TVar is not being updated somewhere

jasagredo avatar Sep 08 '22 12:09 jasagredo

On our sync call today, we think we determined why the ThreadNet tests have been failing. The existing ThreadNet logic that ensure the forge thread waits until the crucial txs have been added was assuming a specific forge thread event happened before the mempool was read. Our UtxoHd refactor of the forge thread, though, made it read the mempool earlier. So if we update the synchronization logic to use the correct forge thread event (the earlier event that is now the last event it emits before reading the mempool), then the tests passed.

nfrisby avatar Sep 08 '22 15:09 nfrisby

After merging the mempool rewrite (#4035), there is one test remaining, the threadnet tests of ouroboros-consensus-cardano-test. We need to look at it before closing this issue.

jasagredo avatar Oct 20 '22 08:10 jasagredo