celo-proposals icon indicating copy to clipboard operation
celo-proposals copied to clipboard

CIP 32 - Attestation Node Incentives - Discussions

Open codyborn opened this issue 4 years ago • 39 comments
trafficstars

CIP 32 https://github.com/celo-org/celo-proposals/pull/121 proposes introducing Attestation Service operation incentives via slashing. This is the first step toward a more discerning incentive system. As such, it's focused on a min uptime bar (one attestation over 11 days) and will not increase validator rewards. Future proposals will likely support more granular uptime measures and the potential for operators to earn additional reward.

codyborn avatar Jan 12 '21 00:01 codyborn

A little bit devil's advocate here, but I'm not sure I agree with implementing slashing for downtime on a service ultimately unrelated to block proposals and consensus of the protocol. This is not an incentive, it's a disincentive.

I definitely prefer a reward rather than punishment model. Once there is a positive economic incentive wont validators have a real desire to keep their service online? (Especially as new users are increasing).

aaronmboyd avatar Jan 13 '21 15:01 aaronmboyd

Thanks for starting the discussion @aaronmboyd. I think this is a matter of perspective. One could view the existing validator reward to cover all services that a validator provides, including running an attestation service. In this case, by default the protocol assumes the operator is running the attestation service properly, and will only withhold some of the reward (via slashing) if it finds out later that this is not the case. To make it more explicit, we could cap the slashable amount to some limit per epoch OR instead of slashing from the LockedGold, we could reduce the block rewards by some capped amount. WDYT?

I went with this perspective since a majority of validators are running Attestation Services (96/100) and a majority are doing the job well. Making this the default case made sense to me, but I'm definitely open to other thoughts. I also believe that we should increase the block reward to those that run the Attestation Service exceptionally well. Since measuring this on-chain introduces more complexity, it was reserved for a future proposal.

codyborn avatar Jan 13 '21 20:01 codyborn

Just read through the proposal and have some comments

Attack on proposed slashing method

A validator that has a downed attestation service, or simply does not want to run one, could do the following in every slashing period.

  1. Request 100 attestation requests (more generally, the number of validators with registered attestation keys), which guarantees exactly one request will request will be assigned to themselves when selecting issuers.
  2. Fulfill the request by generating the required signature and posting it to the chain.

Note that the validator does not need to have possession of the phone number associated with the identifier they request against, nor does the identifier need to correspond to a known phone number and pepper.

This attack can prevent slashing for ~$5 spent every slashing period, about $15/month under the proposal.

Alternative to MinStakeSlasher

MinStakeSlasher seems like an extra step that should not really be necessary. Is there a reason we can't amend the core contracts (i.e. LockedGold.sol and Validators.sol to mark a group an ineligible, or otherwise reduce its number of electable validators, if the stake drops below the requirement as part of the slashing operation?

More generally, the MinStakeSlasher as proposed makes a validator slashable if their stake drops below the highest slashing penalty and reward (i.e. double signing). This still permits validators to operate with less than the normally required stake, making them less vulnerable to repeated slashing (e.g. if they participate in creating a fork of length greater than 1) and reducing the overall deterrence. Instead of allowing validators to remain elected with a stake less than the requirement, but greater than the single greatest slashing penalty, I would suggest we reduce the number of validators they can elect if they drop below the requirement, and encourage validators to maintain a stake in excess of the requirement as a buffer.

Profitability of running an attestation service

Under the current scheme, running an attestation service is not directly profitable. Validators are paid by the protocol around $75k per year, and it seems reasonable to state that this can be the incentive for running an attestation service, as it is a required part of being a validator under this proposal. If that's the reasoning, I think we should make it explicit.

One gap of slashing as the (dis)incentive is that it does not pave the way to independent attestation providers, and instead entrenches the position that only validators provide attestations. Enabling independent attestation providers has been discussed before, but I am not aware of any conclusion. It may be worth revisiting it now to decide whether we want to keep that option open or not.

Another note here is that a slashing penalty of 40 CELO per slashing period, about 120 CELO per month, may still less than the operational costs of setting up a node and any extra hours needed to keep it online. It's not crazy to imagine some validators deciding to lock up an extra 1360 CELO for a year's worth of slashing instead of actually setting up an attestation service.

nategraf avatar Jan 14 '21 01:01 nategraf

Thanks for the feedback Victor!

Attack on proposed slashing method

I agree. It also doesn't solve for the case where an Attestation Service is partially available but still able to meet the min bar. This is the first step toward more discerning incentives. If we do observe this behavior, we can propose a one-off slashing via a governance proposal. I'd like to see a future attestation incentive CIP take into account:

  • Attestation success rate (especially in sessions where other issuers' attestations have been solved). This is less likely to be a user-related failure if the user has entered codes from other issuers.
  • Users who have passed a re-CAPTCHA and DeviceID check. This makes it much harder to automate a script like you mentioned above. To achieve this, we would rely upon the Komenci flow to provide an on-chain attestation after performing this verification.

Alternative to MinStakeSlasher

Good point. We can create a util method similar to SlasherUtil.performSlashing() that slashes and will optionally forceDeaffiliateIfValidator if the stake drops below a min amount. With this proposal, it can also be expected that a validator will want to add buffer to their lockedGold to prevent from being deaffiliated.

Profitability of running an attestation service

I think most validators run an attestation service currently because they want to support the network and it helps differentiation when looking for votes. We see this with 96/100 validators currently running an Attestation Service with no explicit incentive. With slashing comes financial and reputation loss, which I think will make it fair for everyone since many validators have already put effort into running the Attestation Service well.

codyborn avatar Jan 14 '21 03:01 codyborn

Have you thought about a setup that just takes out misbehaving attestation nodes from the pool of nodes that can be selected as an issuer instead?

So instead of a slashing penalty, if a node is suspected to be failing attestations, it becomes blacklisted and blacklisting time can increase exponentially for subsequent failures.

Pros for this approach would be that we can be much more aggressive in determining what it means that node is failing or flaking attestations, because penalty isn't huge. And validators that aren't running healthy attestation nodes will just become blacklisted completely over time. (imo it is fine if only ~50-80% of validators run attestation nodes as long as they are healthy, instead of forcing everyone to run an attestation node and have them be more flaky).

zviadm avatar Jan 14 '21 15:01 zviadm

That's a really good idea to protect the user experience @zviadm. I think this mechanism can work in conjunction with other incentive mechanisms. Without the incentive of slashing or additional rewards, I worry that this alone won't encourage validators to run an Attestation Service. Today there is the foundation voting for AS operators as an incentive, but because some validators are not eligible, it won't apply to everyone and may not be a strong enough incentive. WDYT?

codyborn avatar Jan 14 '21 17:01 codyborn

That's a really good idea to protect the user experience @zviadm. I think this mechanism can work in conjunction with other incentive mechanisms. Without the incentive of slashing or additional rewards, I worry that this alone won't encourage validators to run an Attestation Service. Today there is the foundation voting for AS operators as an incentive, but because some validators are not eligible, it won't apply to everyone and may not be a strong enough incentive. WDYT?

is it a big issue if in this early stage only Foundation Voted validators are strongly incentivized to run an attestation node? i.e. if others don't want to run an attestation nodes, is it really a problem if they just stop running them?

in future, if attestation payouts themselves still aren't enough to incentivize running the node, I would be more in support to add additional pay mechanisms (e.g. diverting some of the cUSD rewards to non blacklisted attestation nodes, as additional validator rewards) instead of forcing people to run an attestation node as part of being a validator with penalties.

When people self-select to run an attestation node, it is easier to make sure and enforce things like everyone having both Twillio and Nexmo accounts (or whatever new messaging platform gets added), having them setup with proper phone numbers, and all the other extra work. It is very different kind of setup work from completing consensus so that is why i think it will be healthier if we promote running a node through rewards instead of penalties.

zviadm avatar Jan 14 '21 17:01 zviadm

Another example of positive enforcement: If we have separate rewards for attestations, later on, we can also adjust rewards based on successful attestations for each node. So now there is another incentive to make sure node runners are completing all the attestations, not just passing the minimum bar.

zviadm avatar Jan 14 '21 17:01 zviadm

@zviadm I agree that it'd be nice to untangle the incentives of running a validator and attestation service. This would also support non-validator attestation operators in the future. One downside is that it will require a hard-fork which will naturally push the timeline out a bit. Since today, the validator rewards are intended to cover operation of the attestation service, do you see any issues with reducing the validator reward and using this difference for rewarding healthy attestation service operation? For example, if validator rewards today are approximately $75k/year, we could reduce them to $60k and have $15k for attestation service incentives?

codyborn avatar Jan 22 '21 03:01 codyborn

Have you thought about a setup that just takes out misbehaving attestation nodes from the pool of nodes that can be selected as an issuer instead?

So instead of a slashing penalty, if a node is suspected to be failing attestations, it becomes blacklisted and blacklisting time can increase exponentially for subsequent failures.

Pros for this approach would be that we can be much more aggressive in determining what it means that node is failing or flaking attestations, because penalty isn't huge. And validators that aren't running healthy attestation nodes will just become blacklisted completely over time. (imo it is fine if only ~50-80% of validators run attestation nodes as long as they are healthy, instead of forcing everyone to run an attestation node and have them be more flaky).

I think this is a great alternative. Especially if we additionally rewarded each completed attestation somehow, the economic incentives would be in place.

aaronmboyd avatar Jan 22 '21 09:01 aaronmboyd

@zviadm I agree that it'd be nice to untangle the incentives of running a validator and attestation service. This would also support non-validator attestation operators in the future. One downside is that it will require a hard-fork which will naturally push the timeline out a bit. Since today, the validator rewards are intended to cover operation of the attestation service, do you see any issues with reducing the validator reward and using this difference for rewarding healthy attestation service operation? For example, if validator rewards today are approximately $75k/year, we could reduce them to $60k and have $15k for attestation service incentives?

Splitting out portion of validator rewards to go to attestation service providers makes sense to me. There are probably many ways to setup the rewards for attestation service providers and most of them will probably work ok. Here is one way that I have been thinking about:

  • Distribute rewards every epoch. Decide on total rewards for all attestation service providers per epoch. (ex. if total attestation rewards per year is: ~15k * 100, would distribute 4.2k cUSD every epoch)
  • To calculate distribution for each attestation node:
    • choose rolling window duration (e.x. 30 epochs)
    • rewardsForNode = "successful attestations for this node during rolling window" / "total successful attestations during rolling window"
    • assuming attestations are more or less randomly distributed, over a reasonable rolling window duration (like 30 epochs) it should average out reasonably well across nodes.

Pros of this setup:

  • incentives for completing as many attestations as possible
  • no penalty if everyone is missing attestations (i.e. some sort of a systematic failure, like a client side bug, not individual node operator failure)
  • incentives to run attestation nodes (i.e. if there are only 10 validators running attestation nodes, per node rewards will be huge, so it will be extra motivation for others to run an attestation node too)

Cons of this setup:

  • There can be some sybil attack scenarios where a malicious node operator could spam bunch of attestations and only complete their own. Thus inflating only their own completion numbers to get bigger share of the rewards.

I think some sort of sybil attack scenario will always exists (no matter the reward scheme), so we would have to depend on governance and kicking out/blacklisting bad actors manually.

zviadm avatar Jan 22 '21 11:01 zviadm

One of the questions that was brought up in the all-core devs call (by @zviadm I believe), was around the potential impact of improving the availability of attestation services. We'd like to know how much we'd need to improve the scores of individual attestations services to make a difference in user completion rate. I've put together this analysis here. The first graph shows the relationship between the failure rate of an Attestation Service (x-axis) and the percentage of users who abandon the flow (y-axis). Each dot is an Attestation Service. image As expected, we can see a pretty decent connection of the two. The hypothesis being that getting paired with a poor AS introduces more friction, which leads to more user abandonment. Since a user gets paired with more than one AS, improving one AS node's availability should improve the abandonment for all other AS nodes.

In an attempt to isolate the impact of improving a poor performing issuer, I've also created the following graph: image The graph shows a cumulative flow abandonment rate, starting from the best performing issuers (left) and progressively adding in the worst (right). For example, if we exclude every issuer with a failure rate* above 30% (0.2 on the x-axis since it’s bucketed on the first decimal) then the flow abandonment rate is 16% (.16 on the y-axis). As we move right on the x-axis, we include more and more worse performing issuers into the calculation; hence the user abandonment rate gets worse. Based on this chart, we can see that if we can improve the failure rate of all issuers to be below 30%, then we can expect overall user abandonment to drop from 20.8% (today) to 16.4%. *Failure rate for an issuer is 1 - rate of attestation code completion. This data is aggregated over the time period (12/16/20-01/16/21)

This shows that improving AS availability from what it is today is a worth-while effort; however we won't see significant gains in user completion until we can get all AS services performing in the 20-30% success range (median is currently 22.5%).

codyborn avatar Jan 27 '21 05:01 codyborn

@zviadm @aaronmboyd I'm currently investigating the feasibility of splitting the validator/AS rewards without a hard-fork. If it needed a hard-fork, it'd likely not be on mainnet until Oct 2021.

codyborn avatar Jan 27 '21 21:01 codyborn

@zviadm @aaronmboyd I'm currently investigating the feasibility of splitting the validator/AS rewards without a hard-fork. If it needed a hard-fork, it'd likely not be on mainnet until Oct 2021.

Could you highlight were a hard-fork would be required? So far I think the proposed changes could be implemented in the core contracts layer, but I could be missing something.

nategraf avatar Jan 30 '21 00:01 nategraf

@nategraf The only reason to have to hardfork is if we're hitting the gas limit for the distributeEpochPaymentsFromSigner call and needed to bump it up before adding more complexity. The gas limit for this call is 1MM per call (each signer gets its own call and 1MM gas). We don't have the gas usage instrumented yet in the blockchain client, however from the protocol unit tests running against ganache, I can see that the gas usage is ~ 100k, giving us sufficient space to work with.

codyborn avatar Feb 01 '21 22:02 codyborn

Hi - we (Polychain) are in favor of the direction this conversation has moved and support decoupling incentives for validators and attestation services. Thank you to those who have contributed to this discussion!

In particular, the analysis on attestation service performance and user abandonment is very informative because it both confirms the importance of improving attestation rates and helps set a target for attestation rates the protocol could incentivize. We're looking forward to seeing where this goes.

mikereinhart avatar Feb 02 '21 16:02 mikereinhart

Just sent out a big update to the Attestation Service Incentives design. Appreciate all of the feedback that went into this. @zviadm a lot of this is riffing on your ideas and if you're open to it, I'd love to make you co-author. https://github.com/celo-org/celo-proposals/pull/161/files?short_path=8d43ce6#diff-8d43ce6ce22e7ebe262974018cb09de0113944f0a54d2af3af073a1183d689d9

codyborn avatar Feb 10 '21 05:02 codyborn

Just sent out a big update to the Attestation Service Incentives design. Appreciate all of the feedback that went into this. @zviadm a lot of this is riffing on your ideas and if you're open to it, I'd love to make you co-author. https://github.com/celo-org/celo-proposals/pull/161/files?short_path=8d43ce6#diff-8d43ce6ce22e7ebe262974018cb09de0113944f0a54d2af3af073a1183d689d9

happy to help. I just skimmed through the updated doc, and got two quick notes:

  • authorizeAttestationSigner can not be called with address(0) because you need to have ProofOfPossession to authorize an address. This is a feature request though in general, there should be a way to "clear signer" without having to create a new dummy key to authorize.
  • How difficult would it be code wise to adjust payout to be: AttestationServiceRewardPercentage * ValidatorPayout * "N of elected validators" / "N of attestation serving validators" ? instead of just: AttestationServiceRewardPercentage * ValidatorPayout. If it is not too difficult to add N of elected validators / N of attestion serving validators that would make payout more dynamic right away and solve the issue of potentially not having enough validators who run attestations.

zviadm avatar Feb 10 '21 20:02 zviadm

Good catch on the PoP. I just realized we already have a method removeAttestationSigner() which sets the signer to address(0): https://github.com/celo-org/celo-monorepo/blob/d07b9267115255d03174a218d31dee8d29b473b1/packages/protocol/contracts/common/Accounts.sol#L317

My main hesitancy of automatically increasing the reward potential when less AS are running is that it provides a direct incentive to bring down the scores of others due to the deregistration feature. Once we have better ways to detect legitimate requests on-chain, then I think this dynamic payout makes sense. We can improve the signal by having Komenci include on-chain attestations when a user passes the reCAPTCHA and device checks in addition to only affecting an issuer's score when a user completed all attestations except for the issuer in question.

codyborn avatar Feb 10 '21 22:02 codyborn

cc @mcortesi who had some ideas around an alternative design

codyborn avatar Feb 17 '21 21:02 codyborn

It seems like we've reached a good point with CIP32 thanks to all of the feedback. You can find the latest version of the spec published here. Does anyone have concerns with moving forward with the implementation and subsequent CGPs? cc @zviadm @nategraf @aaronmboyd @mikereinhart @aslawson @asaj @nambrot

codyborn avatar Feb 23 '21 18:02 codyborn

Hi @codyborn it still needs to be discussed in the All-Core Dev Call and other community members would need to weigh in before we move it from Draft stage to Last Call (review period). Once it finishes the review period, then it's accepted as a standard and then implementation can proceed. Reason for this is because this is a distributed community and folks would need an opportunity to weigh in given this impacts rewards, etc. I think it's a good proposal fwiw.

YazzyYaz avatar Feb 23 '21 20:02 YazzyYaz

Makes sense @YazzyYaz. I see we have a Core Devs 5 this Thursday. Do we have an agenda yet?

codyborn avatar Feb 23 '21 20:02 codyborn

Yeah your CIP is on it :) https://github.com/celo-org/celo-proposals/issues/164

Good call out though, I need to add the agenda to the calendar.

YazzyYaz avatar Feb 23 '21 21:02 YazzyYaz

I feel that we should study some metrics to calculate what the actual rewards breakdown would be. ie. the 20% (attestation) + 80% (validation) breakdown. I feel it lies in a matrix around

  • How much work is involved to validate vs doing attestation
  • How important is attestation for Celo and Celo users

I think in the early days attestation will be a crucial part to onboard users as we want to have a great onboarding experience but after a few years, it might change.

devme25 avatar Feb 25 '21 23:02 devme25

I left this comment on the PR before, but I am not sure if anyone saw it and would like a bit of discussion on the idea.

I have a gut feeling that there are too many parameters in the proposed design. It's just an intuition, but I figured I'd propose an alternative that might be a bit of a simpler direction.

With the target completion rate, I am concerned the the steady-state completion rate may be either lower or higher for well-performing nodes. In the lower case, which would be true if we see an uptick in spurious requests or requests from a country we can't actually reliably deliver SMS, the result will be that validators will see their pay reduced without recourse. In the higher case, the system loses any pressure to continue raising completion rates until a governance proposal is passed, which may take time.

As for the global circuit-breaker, we lose the incentive for nodes to be more resilient than average such as by taking action to setup failover SMS providers. It's kind of the opposite of how Eth2 slashing works where the penalties are designed to disincentivize coordinated failure. We do need to have the system not overly punish validators for events outside their control, such as natural disaster in a given region, but we should create an incentive for validators to overcome those difficulties where possible. (In the extreme case, a validator could decide to shut down it's attestations API whenever the circuit breaker is on, as it will no longer have an ill effect on their rewards)

My alternative proposal is to calculate the completion rate on each epoch, then make the threshold for full rewards be a fixed difference from average. For example, if the completion rate for an epoch is 70/100 and the threshold difference is 0.1, any validators with more than a 60% completion rate will get full rewards. Below that I would suggest the rewards be a linear function of the completion rate in relation to the threshold, as is currently proposed. In this example, a validator with 30% completion for the epoch will receive 50% of the max rewards. When the threshold difference is greater than 0, it is possible for all validators to receive full rewards, but there is still pressure to push their individual completion rate up to ensure it is safely above the threshold. It would also be possible to set the difference below 0 to increase the pressure to raise completion rates, but then it would be impossible for all validators to receive full rewards.

In the scenario of an adverse external event (e.g. a regions only telecom failing), overall completion rates may drop, but the rewards metric will have no delay in adjusting its target. Imagining that an outage or bug causes an overall 30% completion rate, the threshold would be 20% allowing validators to receive full rewards while providing an incentive to be as resilient as possible. It also adjusts well in cases that the overall completion rate goes up as an overall completion rate of 90% would set an 80% threshold.

Many of these points come more from a place of intuition than any drawn-out analysis, so I'd love to have some discussion of the pros and cons.

nategraf avatar Feb 26 '21 18:02 nategraf

@nategraf in regards to leaving comments on PRs, I try to recommend folks to just comment on the Issue ticket since PRs get merged everytime and comments get lost (or rather are harder to find again)

YazzyYaz avatar Mar 01 '21 18:03 YazzyYaz

Hey @nategraf, apologies for missing the previous comment in the PR. I definitely think this idea is worth discussing, especially in light of the recent Twilio and Valora 1.11 failures. What I like about this proposal is that the dynamic threshold automatically accounts for failures outside of the AS operator's control. This can obviate the need for the baseline measurement and fallback reward mechanism. The concern that I have with this approach is that it introduces some incentive to negatively impact other AS's score. By making incomplete requests to other AS and completing them when I randomly hit my own, I can drop the avg for each epoch and guarantee the max payout of my AS. Given how easy this attack is to perform, I was hesitant to add any "competitive" incentive. That being said, I think this problem is still present with the existing proposal, although it's a little harder to achieve since you can't push others' score down to lift your score up.

As for the global circuit-breaker, we lose the incentive for nodes to be more resilient than average such as by taking action to setup failover SMS providers.

In this PR, I'm proposing changing the fallback mechanism slightly which I believe will address this concern. In the event of the fallback rewards, rather than all validators receiving a reduced reward, it just sets a cap on the reward loss. A validator with a resilient (an uncorrelated) setup, can still achieve 100% rewards.

codyborn avatar Mar 02 '21 18:03 codyborn

Hey @nategraf, apologies for missing the previous comment in the PR. I definitely think this idea is worth discussing, especially in light of the recent Twilio and Valora 1.11 failures. What I like about this proposal is that the dynamic threshold automatically accounts for failures outside of the AS operator's control. This can obviate the need for the baseline measurement and fallback reward mechanism. The concern that I have with this approach is that it introduces some incentive to negatively impact other AS's score. By making incomplete requests to other AS and completing them when I randomly hit my own, I can drop the avg for each epoch and guarantee the max payout of my AS. Given how easy this attack is to perform, I was hesitant to add any "competitive" incentive. That being said, I think this problem is still present with the existing proposal, although it's a little harder to achieve since you can't push others' score down to lift your score up.

As for the global circuit-breaker, we lose the incentive for nodes to be more resilient than average such as by taking action to setup failover SMS providers.

In this PR, I'm proposing changing the fallback mechanism slightly which I believe will address this concern. In the event of the fallback rewards, rather than all validators receiving a reduced reward, it just sets a cap on the reward loss. A validator with a resilient (an uncorrelated) setup, can still achieve 100% rewards.

As I mentioned in my original comment, I also still prefer dynamic rewards based on completion rate out of total completions. I personally don't think malicious behavior is something to really worry about. First, there is economic cost to malicious behavior because you have to pay for attestation fees, so it isn't just free-for-all. Second, there is huge reputation and potential Governance based action cost to doing something like this.

My personal preference would still be for simplest solution for dynamic rewards. I.e. something as simple as:

  • Have total reward pool per period.
  • Calculate total completed attestations across all validators for that period.
  • Per validator rewards would be: 'validator completion / total completion * total reward pool'.

Period can be 1 epoch for code simplicity. But ideally, it would be 10 or 15 epochs. That would be better conceptually, but probably not worth the extra complexity.

zviadm avatar Mar 03 '21 14:03 zviadm

First, there is economic cost to malicious behavior because you have to pay for attestation fees, so it isn't just free-for-all.

In the case of running Valora on an emulator, the cost is a reCATPCHA (around $.002 to solve based on online abusive marketplaces). We have work slated for this milestone to add device checks, which should increase the costs of attack. The minimum cost will always be $.5/attestation if an attacker decided to not use Komenci. Let me do some attack EV analysis on both proposals (@nategraf and @zviadm) and we can compare.

Second, there is huge reputation and potential Governance based action cost to doing something like this.

I agree that the cost is a significant deterrent. My concern is that it will be difficult to detect this abuse and even harder to point blame on the attacker.

codyborn avatar Mar 03 '21 20:03 codyborn