polkadot
polkadot copied to clipboard
Batch vote import in dispute-distribution
WIP. Resolves #5881
- [x] Guide Update
- [x] Sending side
- [x] Receiving side
- [x] Tests
- [x] Burnin on Versi
- [x] Stress testing on Versi to see actual improvements - summary on this PR/or ticket.
Especially check on the rate of dispute imports happening.
Note to reviewers: Start with the guide changes as those explain what this PR is about and the whys.
Beginner's question here. Are all Polkadot relay chain disputes handled by the same dispute distribution system? For instance disputes for normal equivocation in relay chain blocks as well as disputes caused by the backing/approvals process for parachain blocks. It seems like maybe this system is only meant for POV based on mentions of backing and availability.
Also, should I flag the gramatical errors I'm spotting in dispute-distribution.md? If so, should that take the form of comments here?
In dispute-distribution.md I had trouble parsing the sentence "To determine whether a dispute is still alive we will issue a DisputeCoordinatorMessage::ActiveDisputes message before each retry run." Is there a triggered response to this ActiveDisputes message which tells us whether the dispute is still active? Or does this work in another way?
In response to the following sentence from dispute-distribution.md "Upon receiving a DisputeRequest message, a dispute distribution will trigger the import of the received votes via the dispute coordinator (DisputeCoordinatorMessage::ImportStatements)"
I'm not totally certain what the process for importing a vote is. Is this about right?
The DisputeCoordinator which lives in the client receives P2P message containing a vote. The DisputeCoordinator then makes a call/extrinsic for the runtime to include the received vote in a block.
In response to the following sentence from dispute-distribution.md
"All those disputes are valid ones and will result in slashing. Let's assume all of them conclude valid, and we slash 1% on those. This will still mean that nodes get slashed 100% in just 20 seconds."
This was hard for me to understand. Is there a more long winded explanation to make this more clear?
We only have disputes for parachain block validity. All our other slashes are simply factual, meaning anyone can slash you by placing your contradicting signatures on-chain, no opinions required. We've votes here because two different validators express different options, instead of the same node differing with itself like everywhere else. We do however obtain a 2/3 majority when finalizing the block that slashes, but this occurs for "safety" aka agreement, not "soundness" ala validity.
In response to the following sentence from dispute-distribution.md
"All those disputes are valid ones and will result in slashing. Let's assume all of them conclude valid, and we slash 1% on those. This will still mean that nodes get slashed 100% in just 20 seconds."
This was hard for me to understand. Is there a more long winded explanation to make this more clear?
Reading the dispute-coordinator section in the guide will likely help a bit with understanding. For the actual slashing strategies, I hope @ordian has appropriate guide changes as well. It seems that slashes are actually not implemented that way at the moment.
In general: If a dispute concludes valid - meaning, a majority of validators said a candidate is valid, then all validators who voted "invalid" will get slashed, but only a "small" amount - like 1% of their slash. Reasoning: They did not try to get anything bad in, they just wasted everybody's time.
In dispute-distribution.md I had trouble parsing the sentence "To determine whether a dispute is still alive we will issue a DisputeCoordinatorMessage::ActiveDisputes message before each retry run." Is there a triggered response to this ActiveDisputes message which tells us whether the dispute is still active? Or does this work in another way?
Clarified.
If so, should that take the form of comments here?
Best to start a proper review - keeps things more organized.
Stress test results:

Queries:
sum(rate(polkadot_parachain_dispute_distribution_imported_requests{chain="versi_v1_9", pod="versi-validator-a-node-19"}[1h])) by (success)
sum(rate(polkadot_parachain_dispute_distribution_received_requests{chain="versi_v1_9", pod="versi-validator-a-node-19"}[1h]))
We can see successful imports maxing out at around 200/s and failing imports at around 500/s. Received requests are around 1000/s.

Successful import rate maxing out at 346/s - and stable at that, error rate is 0. Received requests around 1700/s.
These tests have been performed under extreme conditions, a better comparison would very likely be possible at less extreme conditions (nodes have been crashing), but it becomes apparent already that this PR improves import performance significantly. I would expect the improvement to be even higher under less extreme conditions as with nodes crashing and very high ToFs/signals queuing, ... ordering of disputes becomes less effective which also affects the efficiency of batching. Nodes were still processing a backlog of disputes hours after we disabled malicious nodes! I also saw nodes retrying sending, because they would not receive a response in a timely manner, because of the high ToFs of the dispute-coordinator, which further decreases efficiency.
Another data point to understand how heavy the load was: I saw for example that from a particular candidate inclusion event to triggering of a dispute 8 minutes have passed.
@ordian Can we get this merged? sr-lab check is requested, but given the current state of disputes - it is better to merge than not to. The old distribution was relying on assumptions no longer holding true.
bot merge
Waiting for commit status.
Merge cancelled due to error. Error: Statuses failed for f0a054b71012b1d4119e41fde17d3d1c3b0d4521