polkadot
polkadot copied to clipboard
runtime/disputes: slashing
This PR implements runtime logic for dispute slashing.
- Validator who voted that a valid (decided by a supermajority) candidate (parachain block) is invalid will be slashed 1% of their stake (including nominators). This can happen to very slow validators that timeout during a PVF execution.
- Validator who voted that a invalid (decided by a supermajority) candidate (parachain block) is valid will be slashed 100% of their stake (including nominators) and kicked out of the validator set. This can happen to malicious validators or validators that are >3x faster than the supermajority (we have 2s timeout for backing and 6s for approval-voting).
Implementation
The implementation uses the offences pallet and looks like a hybrid of im-online and grandpa slashing impls.
Meaning, we submit offences for the concluded disputes about the current session candidate directly from the runtime.
If, however, the dispute is about past session, we record pending slashes on chain, without FullIdentification of the offenders.
Later on, a block producer can submit an unsigned transaction with KeyOwnershipProof of an offender and submit it to the runtime to produce an offence.
The reason for this separation is that even though it's currently technically possible to get FullIdentification of the past session validators, we don't want to rely on this information to be available on chain for the past sessions because it's heavy.
Open questions:
- [x] Initial numbers for slashing. Seems like 100% and 1% is OK.
- [x] Do we need exponential slashing? Punt on for now. Might not be needed in the future.
- [x] Slashing for inconclusive disputes. Punt on for now.
- [x] How do we want disabling to work exactly. Using
type Disabledfrom session is probably fine.
TODOs:
- [x] resolve open questions
- [x] fix remaining TODOs in code
- [x] simplify the traits/generics
- [x] zombienet test (only a check for offences)
- [x] benchmarks/weights
Follow-up work:
- staging APIs for disabled validators and pending slashes
- client changes to submit unsigned transactions
Closes #3161.
Do we need exponential slashing?
From an implementation complexity side, it would probably be easier without. 1% is already a pretty high deterrent and client bugs are not unlikely at this stage. From a theoretical perspective we'd probably want exponential slashing kicking in after a certain threshold % of validators and maxing out at 100% at around 1/3-1 of validators, because that can actually pose a risk to finality. In the near term I don't think it matters much.
How do we want disabling to work exactly?
Disabling should apply to validators who vote for invalid blocks. When a validator advocates an invalid block within an era, other validators should ignore all backing messages originating by them for the rest of the era. We should never disable more than f validators within an era and should use a ring-buffer to accomplish that. This means that validators:
- Get to try to attack the network only a small number of times per day (as many backed candidates as they can get in before getting slashed)
- Succeed only with 1/N probability, where N is a large number like 10 milliion or 1 billion.
- Get slashed 100% every time they fail.
Thank you for the feedback.
From an implementation complexity side, it would probably be easier without.
It doesn't have to be exponential, we can do simply multiplier * base_slash (1%), where multiplier is capped at e.g. 5, at which point we also disable a validator. The main concern here is that we seem to slash a max and not a sum for a period of time if I'm not mistaken, so additional misbehaviors would "free" if we don't disable or escalate.
When a validator advocates an invalid block within an era, other validators should ignore all backing messages originating by them for the rest of the era.
That would require keeping track of session key rotations. For starters, I'd implement ignoring for the rest of the session. They will be kicked out of the validator set in the next session anyway.
Is it okay to use type DisableValidators from Session that would include other slashes (GRANDPA and BABE), or do we want exclusively disputes here?
https://github.com/paritytech/polkadot/blob/c81fb04560c843fdc58892f663a111fcdd314b97/runtime/kusama/src/lib.rs#L270
We should never disable more than f validators within an era and should use a ring-buffer to accomplish that.
If I'm not mistaken staking/offences forces new era if enough validators are disabled: https://github.com/paritytech/substrate/pull/9448.
For starters, I'd implement ignoring for the rest of the session. They will be kicked out of the validator set in the next session anyway.
Yes, this seems fine. It only makes the attack ~4x less expensive which still leads to gambler's ruin quite quickly.
Is it okay to use type DisableValidators from Session that would include other slashes (GRANDPA and BABE), or do we want exclusively disputes here?
Including other slashes is fine as long as the same principle applies: no more than 1/3 of validators may be disabled at any time.
Note that in the current form it doesn't enable slashing on Kusama, only on Westend. Still waiting for reviews.
I believe all feedback apart from HostConfiguration configuration and bounded cleanup is addressed. Both of these can be addressed as a follow-up. Please take another look.
The rewards are implemented in #5862
/cmd queue -c bench-bot $ pallet westend-dev runtime_parachains::disputes::slashing
@ordian https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793574 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet westend-dev runtime_parachains::disputes::slashing. Check out https://gitlab.parity.io/parity/mirrors/polkadot/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.
Comment /cmd cancel 37-fdaf9945-ccd7-48e0-81b2-ba949f0ac651 to cancel this command or /cmd cancel to cancel all commands in this pull request.
@ordian Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet westend-dev runtime_parachains::disputes::slashing has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793574 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793574/artifacts/download.
/cmd queue -c bench-bot $ runtime westend-dev runtime_parachains::disputes::slashing
@ordian https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793608 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" runtime westend-dev runtime_parachains::disputes::slashing. Check out https://gitlab.parity.io/parity/mirrors/polkadot/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.
Comment /cmd cancel 38-be92d727-9197-4be7-8ab6-c109f696d545 to cancel this command or /cmd cancel to cancel all commands in this pull request.
@ordian Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" runtime westend-dev runtime_parachains::disputes::slashing has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793608 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793608/artifacts/download.
/cmd queue -c bench-bot $ runtime westend-dev runtime_parachains::disputes::slashing
@ordian https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793759 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" runtime westend-dev runtime_parachains::disputes::slashing. Check out https://gitlab.parity.io/parity/mirrors/polkadot/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.
Comment /cmd cancel 39-5a721570-7a10-410e-a243-2379d20c1196 to cancel this command or /cmd cancel to cancel all commands in this pull request.
@ordian Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" runtime westend-dev runtime_parachains::disputes::slashing has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793759 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/1793759/artifacts/download.
With regards to slash size: I am thinking more and more that a more soft start might make sense: For wasting resources, the most important thing is to disable validators so this cannot result in a DoS.
What about reducing the against valid slash to just one permille for now? For honest validators that slash would be high enough (I think any slash actually would accomplish that) to act and fix any issues on your end, but low enough to not get too mad. Given that you also lose nominators, it would already be pretty bad. So my plan would be:
- longer timeouts
- figuring out better what is actually happening on those disputes
- have no disputes on Kusama for a while (at least not from validators that can be reached)
- lower against valid slashes
- launch (soft) -> get any remaining validators to act as they are losing money
Resoning: Rewards are around 15% per year, hitting a timeout a single time over the course of an entire year would reduce that by 1% - actually more if you lose nominators (is that really the case, I just read that on some channel) and get removed from the active set.
I think at this still early stage of Kusama and Polkadot those lowered slashes would make a lot of sense.
Yes, we can run a while with just disabling, and then bring in the slashing very slowly. We can also just plan to refund the first say five slashes after the validator reports their hardware configuration, assuming it's a better hardware config than the previous guy who got slashed.
Also, we researchers suggested significantly more than just this doubling of the delay in our analysis in https://github.com/paritytech/polkadot/pull/1656 like 6 x I think.
We should avoid the need for that once the deterministic metering works, but just fyi..
What about reducing the against valid slash to just one permille for now?
Assuming you mean 0.1%, I've changed that. Any other changes you'd like to see in this PR? The plan makes sense to me :+1:
With https://github.com/paritytech/polkadot/pull/5951 we will have that 6x. 2s on backing 12s in approvals.
Another change: 7bf295a cc @eskimor.
due to lacking locks-review this will slip to the next release
We've this intermediate "valid but too slow" mode discussed in https://github.com/paritytech/polkadot/pull/1656
I presume the 6x in https://github.com/paritytech/polkadot/pull/5951 does not bring this distinction in?
@AlistairStewart and I were recently leaning towards still accepting "valid but too slow" blocks and not forking the relay chain to kill them. Imho, similar reasoning suggests any "valid but too slow" logic could be omitted for a while, in hopes Sergey's deterministic metering pans out, but if our current problems are bad enough then maybe we need something of this "valid but too slow" logic.
I just wanted to bring up this subtlety..
We've this intermediate "valid but too slow" mode discussed in #1656
I presume the 6x in #5951 does not bring this distinction in?
@AlistairStewart and I were recently leaning towards still accepting "valid but too slow" blocks and not forking the relay chain to kill them. Imho, similar reasoning suggests any "valid but too slow" logic could be omitted for a while, in hopes Sergey's deterministic metering pans out, but if our current problems are bad enough then maybe we need something of this "valid but too slow" logic.
I just wanted to bring up this subtlety..
If by "valid but too slow" you mean candidates that takes to validate somewhere in between 2s and 12s, we currently accept them. Backers who fail to validate a candidate won't initiate a dispute.
If by "valid but too slow" you mean candidates that takes to validate somewhere in between 2s and 12s, we currently accept them. Backers who fail to validate a candidate won't initiate a dispute.
Yes exactly. I'd think backers can simply give up after 2s without ever initiating disputes.
I asked more about approval checkers, who presumably run the full 12s but what do they do if it takes more than 2s?
I asked more about approval checkers, who presumably run the full 12s but what do they do if it takes more than 2s?
Yes, the approval checkers approve every candidate that validates under 12s (after #5951, so after most validators upgrade, 6s as of now).
Cool thanks. We can discuss "valid by slow" messages by approval checkers if https://github.com/paritytech/polkadot/pull/5951 does not suffice for the current problems, but no point worrying just yet. :)
bot merge
Waiting for commit status.
Merge cancelled due to error. Error: Statuses failed for b453b1af5abb6504e0f8eb95fbbf549e8a9f6418