NEPs icon indicating copy to clipboard operation
NEPs copied to clipboard

NEP-509: Stateless validation stage 0

Open walnut-the-cat opened this issue 1 year ago β€’ 2 comments

WIP

walnut-the-cat avatar Sep 19 '23 22:09 walnut-the-cat

Your Render PR Server URL is https://nomicon-pr-509.onrender.com.

Follow its progress at https://dashboard.render.com/static/srv-ck51l1o21fec73aapqgg.

render[bot] avatar Sep 19 '23 22:09 render[bot]

Hi @walnut-the-cat – thank you for starting this proposal. As the moderator, I labeled this PR as "Needs author revision" because we assume you are still working on it since you submitted it in "Draft" mode.

Please ping the @near/nep-moderators once you are ready for us to review it. We will review it again in early January, unless we hear from you sooner. We typically close NEPs that are inactive for more than two months, so please let us know if you need more time.

frol avatar Nov 01 '23 19:11 frol

As a working group member, I'd like to nominate @mfornet and @birchmd as SME reviewers for this NEP.

bowenwang1996 avatar Jun 17 '24 01:06 bowenwang1996

@mfornet , it will be great if you can review the NEP :)

walnut-the-cat avatar Jun 24 '24 19:06 walnut-the-cat

Very nice writeup, it was quite easy to follow and understand. I lean towards approving this NEP.

Couple comments:

security

You write "With this number of mandates per shard and 6 shards, we predict the protocol to be secure for 40 years at 90% confidence."

I'd like to get more details about this and in the shorter time frame.

  • what is the risk within the next 5 years
  • how does it change when number of shards increase ?
  • how much stake/NEAR does attacker need ?

as this is the main goal of doing the stateless validation, right ? so that we can have 50 shards - for example.

  • Does this mean that we completely abandon the idea of slashing ?

performance

  • Do we have info on how big the state witnesses are going to be on average ? (based on current traffic patterns)

  • How much increased latency in block production do we expect ? (before chunkproducer -> block producer, now we're adding a third group in the middle, that will have lower stake on average - so their response/latency might be slower).

  • for Reed Solomon Erasure encoding - do we still plan to send it to all the block producers (for all the shards?)

mm-near avatar Jun 26 '24 18:06 mm-near

@mm-near The "40 years at 90% confidence" calculation was done by me.

It assumes that the attacker has just barely less than 1/3 of the total stake (so they cannot outright take over the protocol), which is about 197 million $NEAR as of today.

The calculation determines the probability of a shard assignment (recall that stake is converted to "mandates" and these are randomly assigned to shards) in which at least one shard has 2/3 of its assigned stake controlled by the attacker. In that case the attacker would be free to push an invalid state transition because it could sign the invalid state witness itself. With 68 mandates per shard and 6 shards total this probability is 8.6e-10.

Then we assume the shard assignments are independent so that we can model it as a Bernoulli process and see how many "trials" it would take before we have a "success" (i.e. how many random shard assignments are there before the attacker obtains a 2/3 majority in one shard). The probability of having m "failures" in a row in a Bernoulli process is (1-p)^m and we want that to happen with 90% confidence (a somewhat arbitrary value chosen by me), so we can have m = ln(0.9)/ln(1-p) trials. This works out to be around 122 million trials.

Now that we know the number of trials we can convert it into a time. At 1 trial per second that is almost 4 years, but at the time Bowen was suggesting to shuffle less often than every block. At 1 trial per 10 seconds we get almost 40 years, which is the number I reported.

We can also do this calculation the other way though. If we take the 5 year timeline you propose, then we can convert that into a number of trials. Let's assume one trial per second since I think the current implementation does shuffle validators every block. Then that is around 157 million trials and we want to know in our Bernoulli process what is probability of having at least 1 success within that many trials. This probability is 1 minus the probability that we have all those trials fail in a row, so 1 - (1-p)^N. This works out to be around 12.7%. So if someone controlled 197 million $NEAR staked Near for five years then there is a 12.7% chance that they would have the opportunity to push an invalid state transition. If we instead assume only 1 trial every 10 seconds then this probability reduces down to around 1.3%.

If you keep the number of mandates per shard the same then this whole calculation does not change much as you increase the number of shards because the theory says that the dependency on the number shards is not very strong after you have more than a few. So the base probability of 8.6e-10 should stay close to the same for any number of shards. But note that increasing the number of shards while keeping the number of mandates per shard the same means increasing the total number of mandates.

birchmd avatar Jun 26 '24 19:06 birchmd

NEP Status (Updated by NEP Moderators)

Status: VOTING

SME reviews:

Protocol Work Group voting indications (❔ | πŸ‘ | πŸ‘Ž ):

  • ❔ @bowenwang1996
  • πŸ‘ @birchmd
  • πŸ‘ @mfornet
  • πŸ‘ @mm-near

victorchimakanu avatar Jun 27 '24 15:06 victorchimakanu

@mm-near

Do we have info on how big the state witnesses are going to be on average ? (based on current traffic patterns)

Please find the metrics based on the current mainnet traffic for a window of 12 hours.

max witness size

Max witness size affects chunk validation latency. Screenshot 2024-07-02 at 18 35 14

avg witness size

Avg witness size determines additional chunk validation network usage. Screenshot 2024-07-02 at 18 37 04

pugachAG avatar Jul 02 '24 16:07 pugachAG

@mm-near the latency you mentioned matches existing one.

Before: BP sends block quickly on receiving chunks, but block is validated only after other block producers apply all its chunks - it was their only way to validate chunks in block. So the next block production happens only after previous chunks were applied. After: BP, additionally, has to wait for endorsements from CVs. But it is equivalent to waiting on applying previous chunks, it is just performed by CVs based on state witness now. After that block is quickly validated by endorsement signatures verification.

Also, BP&CPs are also CVs, so stake on chunk validation remains big. Memtrie is much faster than disk trie, which compensates network latencies for sending state witnesses and endorsements.


UPD: the actual additional latency is introduced on chunk producer side: https://github.com/near/nearcore/issues/10584

Shortly: to produce chunk N, CP must apply chunk N-1, for which BP must produce block N-1, for which CVs must validate ( = apply) chunk N-1. So applying of chunk N-1 appears twice. But again, we expect that speedup in chunk application outweighs that.


Side notes:

  • If BP tracks shard, it will apply chunks for it, but it doesn't block receiving other blocks.
  • Latency of user waiting for transaction outcome shouldn't change.

Let's say only one shard is touched by transaction. To get outcome, we query the RPC node which tracks touched shard. If RPC node tracks shard, it applies chunks from blocks immediately without waiting for endorsements - because chunk application is deterministic. Chunks are validated on chain by endorsements in the next block with chunk, but if user is optimistic, they can just rely on RPC node’ response.

Longarithm avatar Jul 02 '24 19:07 Longarithm

@mm-near

for Reed Solomon Erasure encoding - do we still plan to send it to all the block producers (for all the shards?)

The main purpose of the Reed Solomon Erasure encoding for state witness is to reduce the load on the chunk producer for distributing the state witness. The recipients of the state witness are all the chunk validators, and they are the ones who participate in the partial witness forward and not block producers.

This way we don't put too much network load on the block producers and the network load is localized to the chunk validators. Nodes that have higher number of mandates are validators for multiple shards.

shreyan-gupta avatar Jul 04 '24 19:07 shreyan-gupta

@mfornet answered to chunk validator-related comments.

Yeah, this is a known problem. We discussed it couple times. One idea was to introduce "honeypot state witnesses", the goal of which would be to verify that state witnesses can get invalidated, and penalise validators for blind approvals.

However, the counterarguments are that

  • we have other places in the consensus where blind approval is not penalised at all - e.g. nothing prevents block validators to endorse all blocks
  • for small chunk validators effect is negligible; for bigger stake on blind approvals the situation is effectively the same as many validators colluding, which we can't control anyway.

So any of these solutions would introduce additional complexity (which is already very substantial) and the benefit didn't become clear.

Longarithm avatar Jul 08 '24 19:07 Longarithm

Thank you to everyone who attended the Protocol Work Group meeting! The working group members reviewed the NEP and reached the following consensus:

Status: Approved (Meeting Recording: https://youtu.be/058BZEyXzgU)

  • πŸ‘ @mfornet https://github.com/near/NEPs/pull/509#pullrequestreview-2160754817
  • πŸ‘ @mm-near https://github.com/near/NEPs/pull/509#issuecomment-2192429086
  • πŸ‘ @birchmd https://github.com/near/NEPs/pull/509#pullrequestreview-2122960451

@walnut-the-cat Thank you for authoring this NEP

@birchmd @mfornet Thank you for the review!

flmel avatar Jul 26 '24 14:07 flmel