polkadot-sdk icon indicating copy to clipboard operation
polkadot-sdk copied to clipboard

disputes: implement validator disabling

Open ordian opened this issue 3 years ago • 50 comments

Once a dispute is concluded and an offence is submitted with DisableStrategy::Always, a validator will be added to DisabledValidators list.

Implement on-chain and off-chain logic to ignore dispute votes for X sessions. Optionally, we can ignore backing and approval votes and remove from the reserved validator set on the network level.

Possibly outdated: https://github.com/paritytech/polkadot-sdk/issues/785

⚠️ FOR THE MOST UP TO DATE INFO REFER TO: Disabling Project Board ⚠️

  • [x] Gather requirements across stack
  • [x] Design unified disabling strategy (https://github.com/paritytech/polkadot-sdk/issues/1961)
  • [x] Track disabled validators #2950
  • [x] Read disabled status from the runtime and apply on node side, for ignoring votes for example in dispute-coordinator.
    • [x] https://github.com/paritytech/polkadot-sdk/issues/1591 https://github.com/paritytech/polkadot-sdk/pull/1841
    • [x] https://github.com/paritytech/polkadot-sdk/issues/1592 https://github.com/paritytech/polkadot-sdk/pull/1863
    • [x] dispute-coordinator - #2225
  • [ ] Make disabled state persisted for full era (to avoid repeated slashes)
  • [ ] Remove I am online slash
  • [ ] No chilling: Validators should get re-elected as long as they have enough stake
  • [ ] Handle validators of previous era correctly in disputes
  • [ ] Testing
    • [ ] https://github.com/paritytech/polkadot-sdk/issues/1590
    • [x] https://github.com/paritytech/polkadot-sdk/issues/2249
  • [ ] Provide tooling and instruct validators to monitor for slash events, so we can expect honest operators to react quickly to issues. (Within an era.)
  • [ ] Cleanup tasks
    • [ ] https://github.com/paritytech/polkadot-sdk/issues/1940

Possibly related paper here.

⚠️ FOR THE MOST UP TO DATE INFO REFER TO: Disabling Project Board ⚠️

Goals for new validator disabling/Definition of Done

  1. Not affecting consensus - disabling can never become a security threat.
  2. Handling broken validators nicely (prevent continuous spam).
  3. Plays well with existing disabling in substrate
  4. Makes sure to never break.
  5. (Consequence of the above): We can enable slashing - safe and securely.

Timeline

As quickly as possible, definitely by the end of the year.

ordian avatar Aug 31 '22 14:08 ordian

Also disable validators who repeatedly vote against valid. Disabling means in general that we should not accept any votes/statements from that validator for some time, those include:

  • backing
  • approval
  • explicit dispute statements

In addition, depending on how quickly we disable a validator, it might already have raised thousands of disputes (if it disputes every single candidate for a few blocks), we should also consider deleting already existing disputes (at the dispute-coordinator) in case one side of the dispute consists only and exclusively of disabled validators - so we apply disabling to already pending participations, not just new ones.

This might be tricky to get right (sounds like it could be racy). The reason we should at least think about this a bit, is that so many disputes will delay finality for a significant amount of time resulting in DoS.

Things to consider:

  • How quickly do we disable?
  • How much does the rate limiting in dispute-distribution help?
  • Is the risk worth the complication?

eskimor avatar Sep 01 '22 07:09 eskimor

Also disable validators who repeatedly vote against valid.

That's tracked in paritytech/polkadot-sdk#785 and is purely runtime changes.

How quickly do we disable?

We can disable as soon as a dispute (reaching threshold) concludes.

This might be tricky to get right (sounds like it could be racy)

Indeed. I'd be in favor of not complicated this unnecessarily.

ordian avatar Sep 01 '22 10:09 ordian

Just had a discussion with @ordian . So what is the point of disabling in the first place? It is mostly about avoiding service degradation due to some low number of misbehaving nodes (e.g. just one). There are other mechanism in place which provide soundness guarantees even with such misbehaving nodes, but service quality might suffer for everybody (liveness).

On the flip side, with disabling, malicious actors could take advantage of bugs/subtle issues to get honest validators slashed and thus disabled. Therefore disablement if done wrong, could actually lead to security/soundness issues.

With this two requirements together, we can conclude that we don't need perfect disablement, but an effective rate limit for misbehaving nodes is enough to maintain service quality. Hence we should be able to limit the number of nodes being disabled at any point in time, to something like 10% maybe 20% ... in any case to something less than 1/3 of the nodes. If this threshold is reached, we can either by random choice or based on the amount of accumulated slashes (or both) enable some nodes again.

This way we do have the desired rate limiting characteristics, but at the same time make it unlikely that an attacker can get a significant advantage via targeted disabling.

Furthermore as this is about limiting the amount of service degradation a small number of nodes (willing to get slashed) can cause, it makes sense to only start disabling once a certain threshold in accumulated slashes is reached.

For the time being, we have no reason to believe that these requirements are any different for disabling in other parts of the system, like BABE. We should therefore double check that and if it holds true strive for a unified slashing/disabling system that is used everywhere through the stack in a consistent fashion.

eskimor avatar Nov 08 '22 16:11 eskimor

  1. Figure out a disabling strategy that limits severeness of honest nodes getting disabled.
  2. keep the network functional in all cases: have enough validators enabled for grandpa to work.
  3. Expose an API to the node, for retrieval of disabled validator.
  4. Don't accept statements/votes from disabled validators on node and runtime.
  5. Don't accept connections from disabled validators

eskimor avatar Feb 28 '23 17:02 eskimor

I'll leave my thoughts on a strategy for validator disabling here so that we can discuss it and improve it further (unless it's a total crap :hankey:).

When a validator gets slashed it's disabled following these rules:

  1. The validator will be disabled during the rest of the session. Or in other words - the list of disabled validators will be cleared on each session start.
  2. No more than BYZANTINE_THRESHOLD validators are disabled at the same time. Otherwise we'll break the network.
  3. Each validator will have an offense score indicating how bad his offense was. I think it's safe to use the slash amount for this score. When we reach BYZANTINE_THRESHOLD number of disabled validators, we can re-enable a small offender so that we can disable a bigger one.
  4. If we reach a point where the total offense score is BYZANTINE_THRESHOLD * SLASH FOR SERIOUS OFFENSE we can force a new era, because we have got too much offenders in the active set.

Open questions:

  • [ ] Is it enough to disable a validator for a single session? We can also pick a period based on the seriousness of the offense but I prefer to start simple.
  • [ ] Is 4 from the list above an over-complication?
  • [ ] Should we keep track of the offense score of a validator? For example our disabled list is almost full. We add validator A for a small offense. Then validator B makes something more severe so we remove A and add B. Then validator A does something bad again. What should be his offense score - old score + new offense score or just new offense score? The latter makes more sense to me but it will require extra runtime storage.
  • [ ] Considering the previous point - should we disable validator for a really minor offenses? E.g. voting invalid for a valid candidate? This is related to https://github.com/paritytech/polkadot-sdk/issues/785. The alternative is just to increase it's offense score and disable it if it keeps on causing problems.

tdimitrov avatar May 23 '23 15:05 tdimitrov

Reiterating Requirements:

  1. For re-enabling slashes for approval voters, we need disablement being proportional to the slash.
  2. We would like to rate limit pretty quickly to avoid validators accumulating slashes too much in case of bugs/hardware faults.
  3. We need to make sure to never disable too many validators, as this would cause consensus issues. Target should be adjustable, but 10% seems like a reasonable number.

2 is conflicting with 1, as a small slash would result in barely any rate limiting. On the flip side, if a node is misbehaving it is definitely better to have it disabled and protect the network this way, than keep slashing the node over and over again for the same flaw.

Luckily there is a solution to these conflicting requirements: Having the disabling strictly proportional to the slash is only necessary once a significant number of nodes would get disabled, hence we can introduce another (lower) threshold on number of slashed nodes, if it is below that threshold we just disable all of them, regardless of the amount.

Meaning of Disabling

Disabled nodes will always be determined in the runtime, so we do have consensus. There should be an API for the node to retrieve the list of currently disabled nodes as per a given block. The effect will be that no data from a validator disabled in a block X, should ever end up in block X+1. For simplicity and performance we will ignore things like relay parents of candidates, all that is relevant is the block being built. On the node side, we do have forks, therefore we will ignore data from validators as long as a disabling block is in our view.

Runtime

  • Filter out any statements (backing, dispute, approval, ..) from disabled nodes: Disabled nodes are not able to back a candidate, nor can they raise a dispute/participate in it.
  • Filter out bitfields from disabled nodes.

Node

For all nodes being disabled in at least one head in our current view:

  • Don't connect to disabled nodes, remove them from the reserved set + actively drop any connection attempts. (Maybe only do this for 100% slashed nodes: slash amount/offense score should be exposed)
  • Don't accept any statements from disabled nodes (backing, approval, disputes)
  • Honest nodes should also honor themselves being disabled and should not issue any statements/doing any validations if they have any blocks in their view, for which they are disabled. This is mostly for self protection: If they relied only on others to ignore their dispute statements, they might still get in, in a later block, where they are enabled again - causing them to get slashed again.

Affected subsystems:

  • Provisioner - should filter out any data from disabled validators based on the disabled state of the block currently building upon.
  • Dispute coordinator: Should early drop statements from disabled validators to avoid participation/escalation, this includes own statements if we are disabled ourselves.
  • Backing should ignore statements from disabled validators for a given block, this is so we don't end up validating a candidate proposed by a malicious backer, wasting resources. This is less important than provisioner and dispute coordinator.
  • Approval subsystems should ignore assignments and approvals as long as a disabling block is in view. This is also less important than the provisioner and the dispute-coordinator changes.

If we wanted to go fully minimal on nodes side changes, it should be enough to honor disabled state in the dispute coordinator. Degradation in backing performance should be harmless, approval subsystems are also robust against malicious actors and filtering in the provisioner is strictly speaking redundant as the filtering will also be performed in the runtime.

Disabling Strategy

We will keep a list of validators that have been slashed, sorted by slash amount. For determining for the current block, which validators are going to be disabled we do the following:

  1. We check whether the list of currently slashed validators is less than lower threshold amount (see above), if so - all slashed validators go on the disabled list and we skip the remaining points.
  2. For each slashed validator, add it to the list of disabled validators randomly with a probability equal to their slash amount: 100% slash - always on the list, 10% slash - in 10% of the time, ..
  3. We check whether the list of disabled validators is less than 10% of all validators, if not we randomly remove nodes from the disabled list until we reached the threshold.

I would suggest to ignore slash amount in 3 for simplicity, because:

  1. The higher the slash the higher the probability to be on the list to begin with, so we are already weighing based on slash.
  2. The protocols should be robust against a few rogue validators, having nothing to lose.
  3. Having so many nodes disabled is an edge case, that should never happen and if it did it is very likely due to a bug: Therefore while 100% slashed nodes have nothing to lose, it is actually quite likely that less slashed validators don't behave any better regardless.

Rule 1 protects the network from single (or a low amount) of rogue validators and also protects those validators from themselves: Instead of getting slashed over and over again, they will end up being disabled for the whole session. Giving operators time to react and fix their nodes. (See point 2 in requirements)

This means we will have two thresholds: One where, as long as we are below we always disable 100% and one where, once we are above start to randomly enable validators again.

Disabling, eras, sessions, epochs

Information about slashes should be preserved until a new validator set is elected. With a newly elected validator set, we can drop information about slashed validators and start anew with no validators disabled.

If we settle on this approach, then this would be obsoleted by the proposed threshold system.

eskimor avatar May 25 '23 10:05 eskimor

Two questions/comments:

For all nodes being disabled in at least one head in our current view:

Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall?

And the second related to disabling stragegy:

  1. We check whether the list of disabled validators is less than 10% of all validators, if not we randomly remove nodes from the disabled list until we reached the threshold.

I think we should do this in two steps:

  1. Randomly remove nodes which are not big offenders (100% slash).
  2. If all the nodes in the list are big offenders - start removing them randomly too.

tdimitrov avatar May 25 '23 12:05 tdimitrov

I think we should do this in two steps:

1. Randomly remove nodes which are not big offenders (100% slash).

2. If **all** the nodes in the list are big offenders - start removing them randomly too.

Yes, we could do that, but I argued above that we should be able to keep it simple without any harm done.

Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall?

Yes. Given that attacks on disputes can trigger a finality stall, it would be really bad if attackers could avoid getting disabled by their very attack. While at the same time for honest, but malfunctioning nodes they might already accumulate a significant amount of slash before getting disabled.

eskimor avatar May 25 '23 12:05 eskimor

4. If we reach a point where the total offense score is BYZANTHINE_THRESHOLD * SLASH FOR SERIOUS OFFENSE we can force a new era, because we have got too much offenders in the active set.

What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot of unrelated things. We should consider tooling as well.

[ ] Should we keep track of the offense score of a validator? For example our disabled list is almost full. We add validator A for a small offense. Then validator B makes something more severe so we remove A and add B. Then validator A does something bad again. What should be his offense score - old score + new offense score or just new offense score? The latter makes more sense to me but it will require extra runtime storage.

I think we can just use the slashes as @eskimor suggested. But, yes, if a validator is disabled then reactivated then slashed again we need to recalculate the disabled list.

I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance. My bias is towards handling it economically and increasing the slashing amount if we think repeated misbehavior would bring too much load on the network before a bad actor loses all their stake. However, this probably isn't compatible with the solution we came up with for time overruns (since we have to balance the overrun charge with the collective amount slashed from potentially as much as a byzantine threshold of approval checkers). I'll probably just have to accept this.

Sophia-Gold avatar May 25 '23 19:05 Sophia-Gold

What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot > of unrelated things. We should consider tooling as well.

We discussed it yesterday. It's not a good idea. Starting a new era takes time and it's not safe to force it if we have got too many misbehaving validators. We won't do this.

tdimitrov avatar May 26 '23 07:05 tdimitrov

About the rate limiting, considering that we have that upper limit on disabled nodes. I think having a rate limiting disabling strategy for lesser slashes makes sense and adds little to no complexity. It only makes sense, with accumulating slashes though or alternatively if we considered the slashes being accumulative at least from the disabling strategy perspective. Consider nodes that are not behaving equally bad, some nodes being more annoying than others, then we would disable them more and more until they are eventually silenced, having the network resume normal operation. While other nodes, only having minor occasional hickups or even only one, would continue operating normally.

This also has the nice property that the growth of the disabling ratio for an individual node will automatically slow down, as there are less possibilities for the node to do any offenses. So to get disabled 100%, you really have to be particularly annoying.

About accumulating slashes:

We would like to protect the network from a low number of nodes going rogue, but once disputes are raised by more than just a couple of nodes it is not an isolated issue, but either an attack or more likely a network wide issue.

In case of an attack, it would then be good to have accumulating slashes, in case of a network wide issue - accumulating slashes would still be no real harm, if we can easily refund them - can we?

For isolated issues, nodes are protected from excessive slashing via disabling.

eskimor avatar May 26 '23 15:05 eskimor

A priori, we should avoid randomness here since on-chain randomness is biasable. It makes analyzing this annoying and appears non-essential. I've not thought much about it though, so if it's easy then explain.

We can disable the most slashed nodes of course, which also remains biasable, but not for quite so long in theory.

Ideally, we should redo the slashing for the whole system, aka removing slashing spans ala https://github.com/w3f/research/blob/master/docs/Polkadot/security/slashing/npos.md, but that's a larger undertaking. We'd likely plan for subsystem elsewhere bugs too, which inherently links this to the subsystem.

burdges avatar May 29 '23 22:05 burdges

I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance.

We want slashes to be minimal while still accomplishing their protocol goals. It avoids bad press, community drama, etc.

We do not know exactly what governance considers bugs, like what if the validator violates some obscure node spec rule. It's maybe even political, like based upon who requests a refund, who their ISP is, etc. In fact, there exist stakers like parity and w3f who'd feel reluctant to request refunds for some borderline bugs.

burdges avatar May 29 '23 22:05 burdges

We will keep a list of validators that have been slashed, sorted by slash amount.

We are disabling only slashed validators? We won't disable anyone disputing a valid block or voting for invalid block (unless being a backer)?

tdimitrov avatar Jun 01 '23 12:06 tdimitrov

Yes we only ever disable slashed validators. We do disable on disputing valid block though and we will also slash and disable for approving an invalid block, see paritytech/polkadot-sdk#635 .. but a suitable disabling strategy as discussed here is a prerequisite for the latter.

eskimor avatar Jun 02 '23 15:06 eskimor

And one more question regarding:

  1. For each slashed validator, add it to the list of disabled validators randomly with a probability equal to their slash amount: 100% slash - always on the list, 10% slash - in 10% of the time, ..

If there is space for all 100% slash and all 10% slash (in this case) - should we (a) add all 10% slashed validators to the set or (b) still add them with 10% probability (and potentially skip some validators)?

I think you meant (a) otherwise there is contradiction with:

  1. We check whether the list of currently slashed validators is less than lower threshold amount (see above), if so - all slashed validators go on the disabled list and we skip the remaining points.

tdimitrov avatar Jun 05 '23 11:06 tdimitrov

No it is (b) - point 1 was under the prerequisite that we are below the lower threshold. For point 2 and on-wards this is not the case. Idea being: If there are only a few rogue validators having problems - just disable them and don't bother. It is not a security threat and keeping them silent is better for everybody.

eskimor avatar Jun 05 '23 19:06 eskimor

Yes, my bad. There is no contradiction. If we are at point 2, we are already above the limit.

tdimitrov avatar Jun 05 '23 20:06 tdimitrov

I like thinking of this as rate limiting instead of disabling. Something at least like (1/2)^percentage_slash so that a validator slashed 1% is only active every other block, 2% every 4th block. Probably steeper than this.

And then if we reach a concerning threshold of active validators, even just on average, we can slow the rate limiting. A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either. We could generally have the slower rate limiting apply only to finality and not backing and block production.

The upside of this is it doesn't require randomness. However, the problem is we'd need to think about whether nodes are synced up in how they're rate limited. For example, if you have 10% of the network 50% rate limited that would be fine if the rate limiting is staggered, which is less likely in practice if we don't intentionally design it that way.

Sophia-Gold avatar Jun 05 '23 20:06 Sophia-Gold

A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either.

I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.

The upside of this is it doesn't require randomness.

Can you elaborate on this? How will we pass by without randomness?

tdimitrov avatar Jun 05 '23 20:06 tdimitrov

I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.

This would just be choosing safety over liveness, no?

Can you elaborate on this? How will we pass by without randomness?

We can rate limit deterministically, like in my example. Regardless of whether we do it deterministically or try to do it randomly, we do still probably need to assume all rate limited validators are sometimes inactive in the same slot -- or likely because maybe they were slashed for the same reason. Unless we try to intentionally stagger them and do some complicated bookkeeping around it. So if we want no more than 10% inactive then we'd probably have to back down on rate limiting when 10% are rate limited at all. Maybe that's not a problem.

Sophia-Gold avatar Jun 05 '23 20:06 Sophia-Gold

This would just be choosing safety over liveness, no?

We can put it this way. My main concern was that we were trying to handle a case when there are more than f byzantine nodes but this is not entirely correct. f is related to all validators, not just the ones in the active set right?

My concern with disabling too many validators is killing the network in case of a bug which is not an attack. If we sacrifice liveness aren't we killing any chances of governance to recover the network?

Something at least like (1/2)^percentage_slash so that a validator slashed 1% is only active every other block, 2% every 4th block. Probably steeper than this.

Yes I understand your idea for the disabling now. Thanks!

tdimitrov avatar Jun 06 '23 08:06 tdimitrov

Sophia and I discussed the possibility of letting disabled validators still participate in finality but not back any candidates and maybe also let them not produce blocks (if too many validators get disabled). The problem is, both complicate things:

  1. We can also not block block production endlessly, if too many are blocked malicious nodes took over and can influence randomness more than they should for example.
  2. Backing sounds less harmful at first, but we are planning to moving more and more relay chain functionality to parachains so this can become a security problem as well.

To keep it simple, I would suggest to stick to the boolean kind of disabling we have right now. You can either be disabled at a given point in time or not, there are no other kinds of disabling, like disabled but still allowed to vote on finality and such.

eskimor avatar Jun 12 '23 15:06 eskimor

2. Backing sounds less harmful at first, but we are planning to moving more and more relay chain functionality to parachains so this can become a security problem as well.

Ah, right. We didn't discuss relay chain block production and I just added it in my comment here. Eventually everything we care about continuing on the relay chain will be on a system parachain so backing would be a problem as well.

Sophia-Gold avatar Jun 12 '23 17:06 Sophia-Gold

@kianenigma can you share your feedback on this?

Some context - we want to adapt the validator disabling strategy and expand it to parachain consensus. More or less it's specified in this comment. Do you see any problems with this? Will it play nicely with the rest of the slashing/consensus/etc code in substrate?

More specifically - are you comfortable with the new disabling strategy and making it the default (and only) one in substrate's staking pallet?

tdimitrov avatar Jun 19 '23 13:06 tdimitrov

Luckily there is a solution to these conflicting requirements: Having the disabling strictly proportional to the slash is only necessary once a significant number of nodes would get disabled, hence we can introduce another (lower) threshold on number of slashed nodes, if it is below that threshold we just disable all of them, regardless of the amount.

If I read this correctly, this is easily doable via something like:

pub enum DisableStrategy {
	/// Independently of slashing, this offence will not disable the offender.
	Never,
	/// Only disable the offender if it is also slashed.
	WhenSlashed,
	/// Independently of slashing, this offence will always disable the offender.
	Always,
	/// Disable the offender if more than given percent of the set has already been disabled.
	AfterRatio(Perbill),
}

The changes in slashing.rs would need to check the number of already disabled validators in this era, and pass true or false based on this to add_offending_validator.

I don't know off the top of my head head to get this ratio, and see if we are below or above it, but in theory it should be possible. It might need a query to the session pallet (to which we already have an interface via SessionInterface) to get the number of total validators in the current session, and the ones that have been disabled.

We use the same logic to determine if we should trigger a new era or not, so it should be fine.

A question I have is if disablement is "all or nothing" type of situation, or not? A validator in my mental model, as it stands now, has two main roles:

  1. Block authoring
  2. Parachain consensus

Current interface and implementation does not distinguish between the two and by disabling a validator, deprives them of both duties.

kianenigma avatar Jun 22 '23 13:06 kianenigma

We'll need more nuanced disabling here because we cannot really limit the number disabled. We should take discuss a hybrid scheme:

If a backer votes come back invalid then

  • slash them 100%, and
  • disable permanently from approvals and backing, but allow governance to re-enable.
  • Do not disable them from grandpa.
  • It's unclear if we disable them from block production, maybe your 10% helps here, but a priori we should simply choose disablement or non-disablement here.

If an approval vote comes back invalid then

  • Always disabled them from backing.
  • only count their yes votes as 1/2 in approvals, but their disputes remain full disputes.
  • Do not disable them from grandpa or really even relay chain block production.
  • If at least 3% (?) of validators have invalid approval votes on a parachain, then disable that parachain.

If a dispute comes back incorrect then

  • Disabled them from backing, but maybe milder options work too.
  • Do not limit their approvals or disputes votes.
  • Do not disable them from grandpa or relay chain block production.
  • If at least 3% (?) of validators have incorrect dispute votes on a parachain, then disable that parachain.

We've this insanely fancy slashing system that limits the total slashing, even under bugs which governance never refunds. We might nerf it though, and it's very sensitive to parameter choices, so yes we might screw up by leaning only upon it to limit damage from repeated slashing.

I still owe @kianenigma some more serious reevaluation of this, but it's worth discussing what approvals looks like, assuming the slashing system can continue to have damage limits itself. If this avoids us removing people then we can have a simpler analysis of protocols like grandpa and approvals, and we can avoid the case where an adversary exploits a bug to disable many people in particular.

We've no similar analysis issues with disabling whole parachains, but of course if you disable a critical system parachain then that's another problem. We could've selected parachains immune to disabling, and then just be extra careful with their code upgrades, or even make their code live in the relay chain code and/or transition them to new wasm engines slower.

burdges avatar Jun 23 '23 12:06 burdges

I think disabling parachains should be an orthogonal topic. Getting this right might be even more nuanced than a good validator disabling strategy. @burdges your proposal sounds pretty complex and advantages are not really clear to me. I would really like to keep this as simple as possible, while maintaining reasonable security/liveness. We can for example adjust above proposal, so we will never re-enable a 100% slashed validator, if there are any other to re-enable to reach the threshold (10%). But then this really should be good enough.

eskimor avatar Jul 04 '23 16:07 eskimor

In any scenario where we end up with rate limiting, instead of full disablement we should have rather large strides. They should be larger than DISPUTE_CANDIDATE_LIFETIME_AFTER_FINALIZATION, maybe twice the size. So if a validator is disabled 50% of the time the he would be disabled for 20 blocks in a row and then re-enabled for 20 blocks in a row, instead of flip-flopping each block. This way, by the time the validator is enabled again its dispute votes in that time frame would likely be already obsolete and we minimize the harm done.

eskimor avatar Jul 05 '23 08:07 eskimor

Going back to some of the original requirements:

  1. For https://github.com/paritytech/polkadot-sdk/issues/635, we need disablement being proportional to the slash.

I don't believe proportional disabling is needed at all as explained in here.

2 is conflicting with 1, as a small slash would result in barely any rate limiting.

If 1 doesn't require proportional disabling then there is no longer any conflict between them and we can simply disable fully even for minor slashes in spirit of...

if a node is misbehaving it is definitely better to have it disabled and protect the network this way, than keep slashing the node over and over again for the same flaw.

Which is essentially aiming to protect honest nodes with minimal slashing by fully disabling it.

We recently had a chat about it but I'd point it out here on record. Disabling on minor slashes and accumulating slashes should both provide enough security as a deterrent against repeating offences, but disabling for minor offences is more lenient for honest faulty nodes and that's why we prefer it. Ideally we'd have both disabling AND accumulating as attackers can still commit multiple minor offences (for instance invalid on valid disputes) in the same block before they get punished and disabled, but damages done should be minimal so it's not a huge priority.

Overkillus avatar Sep 04 '23 15:09 Overkillus