polkadot icon indicating copy to clipboard operation
polkadot copied to clipboard

Batch vote import in dispute-distribution

Open eskimor opened this issue 3 years ago • 10 comments

WIP. Resolves #5881

  • [x] Guide Update
  • [x] Sending side
  • [x] Receiving side
  • [x] Tests
  • [x] Burnin on Versi
  • [x] Stress testing on Versi to see actual improvements - summary on this PR/or ticket.

Especially check on the rate of dispute imports happening.

Note to reviewers: Start with the guide changes as those explain what this PR is about and the whys.

eskimor avatar Aug 17 '22 10:08 eskimor

Beginner's question here. Are all Polkadot relay chain disputes handled by the same dispute distribution system? For instance disputes for normal equivocation in relay chain blocks as well as disputes caused by the backing/approvals process for parachain blocks. It seems like maybe this system is only meant for POV based on mentions of backing and availability.

BradleyOlson64 avatar Aug 19 '22 23:08 BradleyOlson64

Also, should I flag the gramatical errors I'm spotting in dispute-distribution.md? If so, should that take the form of comments here?

BradleyOlson64 avatar Aug 20 '22 00:08 BradleyOlson64

In dispute-distribution.md I had trouble parsing the sentence "To determine whether a dispute is still alive we will issue a DisputeCoordinatorMessage::ActiveDisputes message before each retry run." Is there a triggered response to this ActiveDisputes message which tells us whether the dispute is still active? Or does this work in another way?

BradleyOlson64 avatar Aug 20 '22 01:08 BradleyOlson64

In response to the following sentence from dispute-distribution.md "Upon receiving a DisputeRequest message, a dispute distribution will trigger the import of the received votes via the dispute coordinator (DisputeCoordinatorMessage::ImportStatements)"

I'm not totally certain what the process for importing a vote is. Is this about right?

The DisputeCoordinator which lives in the client receives P2P message containing a vote. The DisputeCoordinator then makes a call/extrinsic for the runtime to include the received vote in a block.

BradleyOlson64 avatar Aug 20 '22 01:08 BradleyOlson64

In response to the following sentence from dispute-distribution.md

"All those disputes are valid ones and will result in slashing. Let's assume all of them conclude valid, and we slash 1% on those. This will still mean that nodes get slashed 100% in just 20 seconds."

This was hard for me to understand. Is there a more long winded explanation to make this more clear?

BradleyOlson64 avatar Aug 20 '22 01:08 BradleyOlson64

We only have disputes for parachain block validity. All our other slashes are simply factual, meaning anyone can slash you by placing your contradicting signatures on-chain, no opinions required. We've votes here because two different validators express different options, instead of the same node differing with itself like everywhere else. We do however obtain a 2/3 majority when finalizing the block that slashes, but this occurs for "safety" aka agreement, not "soundness" ala validity.

burdges avatar Aug 20 '22 14:08 burdges

In response to the following sentence from dispute-distribution.md

"All those disputes are valid ones and will result in slashing. Let's assume all of them conclude valid, and we slash 1% on those. This will still mean that nodes get slashed 100% in just 20 seconds."

This was hard for me to understand. Is there a more long winded explanation to make this more clear?

Reading the dispute-coordinator section in the guide will likely help a bit with understanding. For the actual slashing strategies, I hope @ordian has appropriate guide changes as well. It seems that slashes are actually not implemented that way at the moment.

In general: If a dispute concludes valid - meaning, a majority of validators said a candidate is valid, then all validators who voted "invalid" will get slashed, but only a "small" amount - like 1% of their slash. Reasoning: They did not try to get anything bad in, they just wasted everybody's time.

eskimor avatar Aug 22 '22 10:08 eskimor

In dispute-distribution.md I had trouble parsing the sentence "To determine whether a dispute is still alive we will issue a DisputeCoordinatorMessage::ActiveDisputes message before each retry run." Is there a triggered response to this ActiveDisputes message which tells us whether the dispute is still active? Or does this work in another way?

Clarified.

eskimor avatar Aug 22 '22 10:08 eskimor

If so, should that take the form of comments here?

Best to start a proper review - keeps things more organized.

eskimor avatar Aug 22 '22 10:08 eskimor

Stress test results:

Imports without this PR:

Screenshot from 2022-09-15 13-37-53

Queries:

sum(rate(polkadot_parachain_dispute_distribution_imported_requests{chain="versi_v1_9", pod="versi-validator-a-node-19"}[1h])) by (success)

sum(rate(polkadot_parachain_dispute_distribution_received_requests{chain="versi_v1_9", pod="versi-validator-a-node-19"}[1h]))

We can see successful imports maxing out at around 200/s and failing imports at around 500/s. Received requests are around 1000/s.

Situation with this PR:

Screenshot from 2022-09-15 13-43-36

Successful import rate maxing out at 346/s - and stable at that, error rate is 0. Received requests around 1700/s.

These tests have been performed under extreme conditions, a better comparison would very likely be possible at less extreme conditions (nodes have been crashing), but it becomes apparent already that this PR improves import performance significantly. I would expect the improvement to be even higher under less extreme conditions as with nodes crashing and very high ToFs/signals queuing, ... ordering of disputes becomes less effective which also affects the efficiency of batching. Nodes were still processing a backlog of disputes hours after we disabled malicious nodes! I also saw nodes retrying sending, because they would not receive a response in a timely manner, because of the high ToFs of the dispute-coordinator, which further decreases efficiency.

Another data point to understand how heavy the load was: I saw for example that from a particular candidate inclusion event to triggering of a dispute 8 minutes have passed.

eskimor avatar Sep 15 '22 11:09 eskimor

@ordian Can we get this merged? sr-lab check is requested, but given the current state of disputes - it is better to merge than not to. The old distribution was relying on assumptions no longer holding true.

eskimor avatar Oct 03 '22 09:10 eskimor

bot merge

eskimor avatar Oct 04 '22 12:10 eskimor

Waiting for commit status.

Merge cancelled due to error. Error: Statuses failed for f0a054b71012b1d4119e41fde17d3d1c3b0d4521

Need to teach the dictionary a couple of words

I meant it should have been fixed here :sweat_smile:

ordian avatar Oct 04 '22 16:10 ordian