celo-proposals [CIP 0017] Integrate plumo proofs into the lightest sync client

WIP

Jul 27 '20 22:07 lucasege

Some high level questions I'd like for us to shake out here are:

What is the data structure of a proof?
- What fields do we need to verify a proof?
- What fields do we need to index proofs for retrieval?
- Is a proof unique when under a particular identifier (e.g. start and end epoch)?
What does a typical proof distribution flow look like among full nodes? (Including examples may help us shake out edge cases and give a better mental picture)
- How big are the proofs? What is the size of a request for the proof IDs versus just returning the proofs themselves?
What does a typical proof distribution flow look like among light clients?
What do we do with bad proofs?
- If a light client receives a bad proof from the server, what does the client do?
- If a full node receives a bad proof from a peer, what does it do?
- Is there and chance that a maliciously constructed proof could cause issues (e.g. DoS, indexing bad proofs, etc)?
How many proofs might we expect to handle?
- Can we establish an upper bound on the number of proofs?
- Do we need to consider pruning duplicate proofs?
How can we enable ourselves to extend this protocol in the future?
- E.g. Can make the message flexible to include things like bloom filters as needed in the future?
- Can we support improvements to the Plumo circuit? (e.g. proof/circuit/CRS versioning)

Jul 29 '20 17:07 nategraf

Responded in-line, great questions - got me thinking about a lot more intricate details.

What is the data structure of a proof?

A proof in celo-blockchain will just be the serialized proof, so a string of bytes. This will also likely be accompanied by the relevant epochs and the validator set diffs.

What fields do we need to verify a proof?

Verification key: []byte (global - ideally I’d like to store this in a SC to allow for governable upgrades, but it appears to be quite long, getting an estimate)
Proof : []byte
firstEpoch/lastEpoch : { epochIndex, maxNonSigners, validatorSet }
- maxNonSigners is the difference between the validator set size and the minimum quorum size
This is the go fn to call into the snark verifier for reference https://github.com/celo-org/celo-bls-go/blob/master/snark/snark.go

What fields do we need to index proofs for retrieval?

Currently I’m thinking just {firstEpoch, lastEpoch} to retrieve, but you also need the above params to verify.

Is a proof unique when under a particular identifier (e.g. start and end epoch)?

This is not guaranteed, no

What does a typical proof distribution flow look like among full nodes? (Including examples may help us shake out edge cases and give a better mental picture)

This is currently in flux, but here is my most up to date plan based on some changes incoming to the CIP:
- Proving service adds proof to its full node via RPC, node verifies the proof and accepts if it verifies
- The proof is then added to this node’s proofDb
- This node then broadcasts this proof to all its peers
  - This needs more thought and should be modeled almost exactly how blocks are propagated.
- The verification/storage/broadcast process then repeats on each peer

How big are the proofs? What is the size of a request for the proof IDs versus just returning the proofs themselves?

Proofs are a couple of kBs: for Groth16 estimated between 364 bytes in the best case (no validator set differences) and around 10kB (100 validator set differences)
The requests are still being fleshed out right now, but a proof request in my current idea - where you would send a message that includes all the PlumoProofEpochs spans that you have, and receive a response of all that you’re missing - the request would be the size of the base message + 16 bytes for each proof you have.
This seems better than receiving multiple kBs per proof for all proofs

What does a typical proof distribution flow look like among light clients?

TBD, currently no plan for distribution among light clients, only from les server to its client

What do we do with bad proofs?

TBD

If a light client receives a bad proof from the server, what does the client do?

Proof is dropped, unsure how we can signal the server, but we should do that too.

If a full node receives a bad proof from a peer, what does it do?

TBD, but likely signals the peer or drops the peer.

Is there and chance that a maliciously constructed proof could cause issues (e.g. DoS, indexing bad proofs, etc)?

Assuming we don’t introduce any bugs, no. All proofs should be verified before inserting and before any further broadcasting

How many proofs might we expect to handle? Can we establish an upper bound on the number of proofs?

Not really, but if we adopt @kevjue 's suggestion that we prune redundant proofs then we could theoretically upper bound the proofs as: totalEpochs / epochsPerProof, where epochsPerProof has been estimated ~100, and of course totalEpochs increases monotonically

Do we need to consider pruning duplicate proofs?

I think so, if we can implement it quickly. It isn’t a requirement, but is a worthy optimization and will prevent some nasty logic when trying to find a proof (verifying that you didn’t miss a faster proof path)

How can we enable ourselves to extend this protocol in the future? E.g. Can make the message flexible to include things like bloom filters as needed in the future?

TBD, will discuss on our sync.
We should discuss the bloom filter specifically since I'm not sure we can actually utilize one here since the key space is constantly increasing.

Can we support improvements to the Plumo circuit? (e.g. proof/circuit/CRS versioning)

Yes, this is also something we should discuss. My plan to do this would be to publish the hash of the current VK to blockchain_parameters and verify on that, although I’m not sure how regularly we can/should do that.
We could also extend this logic to do something like storing a dictionary i.e. { 0-1000: VK_0, 1000: VK_1 } such that the original proofs are still valid and useable. This introduces more complication on the verification side, though

Jul 29 '20 20:07 lucasege

Responding "double in-line" to add some more comments.

What fields do we need to verify a proof?

* Verification key: []byte  (global - ideally I’d like to store this in a SC to allow for governable upgrades, but it appears to be quite long, getting an estimate)

The verification key is quite short, especially because the actual instance that's verified is a hash of the initial and final epochs. The prover key is the one that's very long.

* Proof : []byte

* firstEpoch/lastEpoch : { epochIndex, maxNonSigners, validatorSet }
  
  * maxNonSigners is the difference between the validator set size and the minimum quorum size

You don't need this to verify the proof. If the proof verifies it implies this number is < 1/3 of validators.

Is a proof unique when under a particular identifier (e.g. start and end epoch)?

This is not guaranteed, no

Zero-knowledge proofs and interactive proofs always use randomness, but we're using a non-zk version of the non-interactive Groth16 proof system, which results in unique proofs given the same initial and final epoch (short of the existence of a fork that merges back at some later point in the epoch message chain).

How big are the proofs? What is the size of a request for the proof IDs versus just returning the proofs themselves?
* Proofs are a couple of kBs: for Groth16 estimated between 364 bytes in the best case (no validator set differences) and around 10kB (100 validator set differences)

Proofs for Groth16 are always the same size over a given pairing-friendly curve: two group one and one group two elements. Our proofs are over BW6-761, so they will be 383B.

The client will already have in its state the validator keys for the initial epoch and needs the keys for the final epoch. Keys are one element in group 2 of BLS12-377, so they are 95B each. I think it just makes sense to send the 100 keys instead of the diff because a maximum saving 9.5kB is not worth adding the logic of the diff.

How many proofs might we expect to handle? Can we establish an upper bound on the number of proofs?

Not really, but if we adopt @kevjue 's suggestion that we prune redundant proofs then we could theoretically upper bound the proofs as: totalEpochs / epochsPerProof, where epochsPerProof has been estimated ~100, and of course totalEpochs increases monotonically

Do we need to consider pruning duplicate proofs?

I think so, if we can implement it quickly. It isn’t a requirement, but is a worthy optimization and will prevent some nasty logic when trying to find a proof (verifying that you didn’t miss a faster proof path)

How can we enable ourselves to extend this protocol in the future? E.g. Can make the message flexible to include things like bloom filters as needed in the future?
* TBD, will discuss on our sync.

* We should discuss the bloom filter specifically since I'm not sure we can actually utilize one here since the key space is constantly increasing.
Can we support improvements to the Plumo circuit? (e.g. proof/circuit/CRS versioning)
* Yes, this is also something we should discuss. My plan to do this would be to publish the hash of the current VK to `blockchain_parameters` and verify on that, although I’m not sure how regularly we can/should do that.

* We could also extend this logic to do something like storing a dictionary i.e.` { 0-1000: VK_0, 1000: VK_1 }` such that the original proofs are still valid and useable. This introduces more complication on the verification side, though

@kobigurk Can you explain if it's possible with the new circuit that allows "dummy epoch messages" to create a proof covering any number of epochs up to some limit. I think nodes should only need to store about 3 proofs per year passed assuming this is the case that limit is 120 as suggested elsewhere. Again if this is the case then during epoch 256 a node would just be storing proofs for (0, 120), (120, 240), (240, 256).

There is the case to consider where a client goes offline for a long time right at 180 and comes back at 256. They shouldn't have to download and verify 60 headers before they receive the next SNARK, so storing more incremental proofs will help in this scenario.

Aug 02 '20 06:08 psivesely

One thing missing from discussion here is batch verification. Batch verification refers to using a modified version of a verification algorithm to verify multiple SNARKs/ signatures/ etc. together faster than they could be independently verified using the standard verification algorithm. If we ensure the proofs we store line up (e.g., (0, 120), (120, 240), (240, 256)), then in our case there is another optimization we can use: following the example above, a client can receive 3 proofs pi_1,pi_2,pi_3 for the 3 time periods above, but they only need to receive the 9.5kB of final validator keys at epoch 256, and a hash-to-field (of size 2 field elements over BW6-761 = 96B) for both the validator keys at 120 and 240. So since they're small, the hash-to-field of the initial epoch of each proof should be stored as well so they don't need to be recomputed.

Aug 02 '20 18:08 psivesely

Just wanted to correct something: the honest non-zk-variant groth16 prover will produce the same proof for an instance x, but anyone can randomize any valid proof for x to create another valid proof. E.g., if (A, C, B) in G1^2 x G2 then so is (rA, C, r^(-1)B) for any r in Fp.

Aug 04 '20 18:08 psivesely

@kobigurk Can you explain if it's possible with the new circuit that allows "dummy epoch messages" to create a proof covering any number of epochs up to some limit. I think nodes should only need to store about 3 proofs per year passed assuming this is the case that limit is 120 as suggested elsewhere. Again if this is the case then during epoch 256 a node would just be storing proofs for (0, 120), (120, 240), (240, 256).

There is the case to consider where a client goes offline for a long time right at 180 and comes back at 256. They shouldn't have to download and verify 60 headers before they receive the next SNARK, so storing more incremental proofs will help in this scenario.

Replying to my own message, I just reviewed the code for the new circuit and it is indeed the case you can now make a proof from A to B for all B in {A+1,A+2,...,A+120}. So ideally the proofs a node would have saved at time 256 would be (1, 120), (120, 240), (240, 256). The node would also store the hash to field for (index, validator_keys, max_non_signers) triple at times 120 and 240. (To clarify something from earlier, I was wrong about max_non_signers, it's part of the instance now rather than the witness to account for epochs where the committee is "underfilled.") A bootstrapping client at time t=256 receives the 3 proofs, the hash-to-field of (index, validator_keys, max_non_signers) for t={120,240}, and the (index, validator_keys, max_non_signers) for t=256. The client hash-to-fields this final triple themselves (and has the hash-to-field of t=1 hardcoded as part of the genesis data), and then plugs the resulting 4 field elements and 3 proofs into the resulting Groth16 batch verification equation. (I may have misspoke earlier, the hash-to-field of each instance is 2 field elements, but the instance is two triples and each one is hashed individually resulting in 1 field element).

As per the case of a client going offline for a while, I had the idea to simple "restart" them from the genesis block so they could quickly sync. The downside of this is that if a long range attack happens against blocks they've already validated, now they will be affected by it, whereas before they were safe. The upside is the logic around what proofs to send, pruning, etc. stays extremely simple: at time t you always send the same ceil(t/120) proofs, floor(t/120) partial instance hashes, and the current epoch's (index, validator_keys, max_non_signers). There will be some lag, maybe 30min between when epochs roll over, and when a new proof can be created for that day, and for that time you can just send the proof chain and associated data up to t-1 and then the full epoch message for t. As soon as the proof for t rolls in and you validate it, drop the proof for t-1 and its associated data unless t-1 % 0 = 0, in which case save it along with the hash-to-field of (index, validator_keys, max_non_signers) for that day. I just talked with @kobigurk and apparently he and some other people were discussing and leaning towards this solution as well for clients who go offline for too long to catch up using reasonable amounts of data (if they have to verify each epoch message themselves). Not sure where cutoff point will be, but we can discuss this more later.

Aug 05 '20 16:08 psivesely

Good points here @psivesely , thanks for correcting and verifying things I've been guessing/approximating at: https://github.com/celo-org/celo-proposals/pull/41#issuecomment-667634302. I'll pull your comments out for cleanliness:

The verification key is quite short, especially because the actual instance that's verified is a hash of the initial and final epochs. The prover key is the one that's very long.

Good point, I was referring to a point Kobi had mentioned about not wanting to store these on chain and I wrongly assumed it was because they were long. Either way, we've been planning to store VKs within the client, so if they are indeed short that works great.

You don't need this to verify the proof. If the proof verifies it implies this number is < 1/3 of validators.

Interesting, I've just been referring to the BLS-go snark API which seems to require this info, would it be possible to create a new EpochBlock struct without that field then? Not a big deal since I believe the number is constant barring gov changes, but curious why that field exists if this is the case, or if I'm misunderstanding.

Zero-knowledge proofs and interactive proofs always use randomness, but we're using a non-zk version of the non-interactive Groth16 proof system, which results in unique proofs given the same initial and final epoch (short of the existence of a fork that merges back at some later point in the epoch message chain). Just wanted to correct something: the honest non-zk-variant groth16 prover will produce the same proof for an instance x, but anyone can randomize any valid proof for x to create another valid proof. E.g., if (A, C, B) in G1^2 x G2 then so is (rA, C, r^(-1)B) for any r in Fp.

This makes sense, we haven't been working on any assumption of determinism just in case of an "attack" like this - good to hear that those worries weren't unfounded. I don't think there would be any significant optimizations we could implement if we assumed proofs were deterministic anyway so this works for me.

The client will already have in its state the validator keys for the initial epoch and needs the keys for the final epoch. Keys are one element in group 2 of BLS12-377, so they are 95B each. I think it just makes sense to send the 100 keys instead of the diff because a maximum saving 9.5kB is not worth adding the logic of the diff.

I agree in principle here, but I'll take a look when I'm implementing this to see if this tradeoff is really worth it. I'll also take a look at val set churn, if the churn over ~120 epochs on average is << 100 then it may be worth the diff logic.

Aug 05 '20 17:08 lucasege

One thing missing from discussion here is batch verification. Batch verification refers to using a modified version of a verification algorithm to verify multiple SNARKs/ signatures/ etc. together faster than they could be independently verified using the standard verification algorithm. If we ensure the proofs we store line up (e.g., (0, 120), (120, 240), (240, 256)), then in our case there is another optimization we can use: following the example above, a client can receive 3 proofs pi_1,pi_2,pi_3 for the 3 time periods above, but they only need to receive the 9.5kB of final validator keys at epoch 256, and a hash-to-field (of size 2 field elements over BW6-761 = 96B) for both the validator keys at 120 and 240. So since they're small, the hash-to-field of the initial epoch of each proof should be stored as well so they don't need to be recomputed.

This is a great point - this was mentioned briefly but I've neglected to add it. Thanks for going deep on this, storing the hash-to-group value within each proof makes sense to me as well then, I'll add that to the spec.

Aug 05 '20 17:08 lucasege

You don't need this to verify the proof. If the proof verifies it implies this number is < 1/3 of validators.

Interesting, I've just been referring to the BLS-go snark API which seems to require this info, would it be possible to create a new EpochBlock struct without that field then? Not a big deal since I believe the number is constant barring gov changes, but curious why that field exists if this is the case, or if I'm misunderstanding.

You do need it. I corrected myself here: https://github.com/celo-org/celo-proposals/pull/41#issuecomment-669291180.

Zero-knowledge proofs and interactive proofs always use randomness, but we're using a non-zk version of the non-interactive Groth16 proof system, which results in unique proofs given the same initial and final epoch (short of the existence of a fork that merges back at some later point in the epoch message chain). Just wanted to correct something: the honest non-zk-variant groth16 prover will produce the same proof for an instance x, but anyone can randomize any valid proof for x to create another valid proof. E.g., if (A, C, B) in G1^2 x G2 then so is (rA, C, r^(-1)B) for any r in Fp.

This makes sense, we haven't been working on any assumption of determinism just in case of an "attack" like this - good to hear that those worries weren't unfounded. I don't think there would be any significant optimizations we could implement if we assumed proofs were deterministic anyway so this works for me.

This only holds for proofs directly received from honest provers. See update on this: https://github.com/celo-org/celo-proposals/pull/41#issuecomment-668743912

The client will already have in its state the validator keys for the initial epoch and needs the keys for the final epoch. Keys are one element in group 2 of BLS12-377, so they are 95B each. I think it just makes sense to send the 100 keys instead of the diff because a maximum saving 9.5kB is not worth adding the logic of the diff.

I agree in principle here, but I'll take a look when I'm implementing this to see if this tradeoff is really worth it. I'll also take a look at val set churn, if the churn over ~120 epochs on average is << 100 then it may be worth the diff logic.

I just remembered client has to do a fair amount of work to decompress the public keys after receiving them in order to actually use them to check block signatures, and that the decompressed keys should be cached by them (pretty sure this is implemented already). To avoid re-doing the decompression work you'd want to implement basically the same logic on the client side to make the diff. So actually now I'm in favor of sending just the diff. I can think of a diff format that for 120 keys is 120B + however many bytes the new public keys are. Lmk if you have a shorter one :D

Aug 05 '20 19:08 psivesely

I feel some of my comments regarding my opinions on what proofs to store have been unclear, so to explain my current thinking a bit more clearly I propose storing the following set: {[120i, 120i+j] | i in (0,1,2,...), j in (1,2,...,120)}. We want to support two operations:

bootstrap(): a new ultralight client hardcoded with the genesis state has just joined the network---let's get it the latest committee public keys so it can start transacting. The server sends the client the minimum number of proofs from the genesis validator set to the current committee and a single validator diff. There maybe be a window between when an epoch begins and when the server receives a proof that includes the last epoch message, so in this case the server sends proofs/diff up to the penultimate epoch as well as the last epoch message and its multisig. As noted above, if a client goes offline for ~ a couple weeks it will be more data and CPU efficient for them to re-bootstrap instead of downloading and verifying the non-SNARK epoch messages for that time span. The exact crossover point will depend on how much time has passed since the first epoch, but we can just set some fixed value independent of time where it will be cheaper for at least the next 5 years to re-bootstrap because we'll be on some new protocol version doing different stuff by then anyway.
verify_txs([txid_1,...,txid_n]): an ultralight client just (re-)bootstrapped and would like to recover it's transaction history---assuming it already got a list of txs to/from its address from some index searching service, lets provide it with proofs that all those transactions are valid. Given a vector of transaction identifiers, the server sends the minimal proof chain leading up to each one, minus the proofs that will already be sent because they are required for previous epochs. The server then sends the relevant block headers, their multisignatures (and accompanying bitmaps), and the Merkle tree proof of tx inclusion (if multiple transactions are queried in the same block, the server can reduce communication requirements by not sending the same sibling nodes twice nor nodes the client will be able to compute and cache given previous inclusion proofs it receives). The client batch verifies both the SNARKs and the multisignatures.

Aug 18 '20 05:08 psivesely

As noted above, if a client goes offline for ~ a couple weeks it will be more data and CPU efficient for them to re-bootstrap instead of downloading and verifying the non-SNARK epoch messages for that time span

I think there's an option you're not considering, which is to download and verify plumo proofs starting, not at the latest epoch for which the light client has the validator set, but from the last epoch for which the full node is able to prove up to the current epoch with plumo proofs only.

Aug 18 '20 11:08 asaj

I propose storing the following set: {[120i, 120i+j] | i in (0,1,2,...), j in (1,2,...,120)}

This is what I was thinking as well 👍

Aug 18 '20 11:08 asaj

I think there's an option you're not considering, which is to download and verify plumo proofs starting, not at the latest epoch for which the light client has the validator set, but from the last epoch for which the full node is able to prove up to the current epoch with plumo proofs only.

Yeah, I was just over simplifying, but to give a more optimized re-bootstrapping parameter, let t be the latest epoch for which a client has verified the validator public key set (using any combination of SNARKs and multisigs), let t + ∆ be the current epoch, let s = floor(t / 120), and let s' = floor((t+∆)/120). Then when ∆ > 5/4(s'-s+1) + 10 the client should verify s'-s+1 SNARK proofs to re-bootstrap, and when ∆ ≤ 5/4(s'-s+1) + 10 the client should verify ∆ multisigs to re-bootstrap. This is based on verification time, and not on the amount of data that needs to be downloaded. I could try to provide bounds for that as well, but would need an estimate for expected verifier churn.

The bounds above are based on the following assumptions/observations. Batch Groth16 requires n+3 pairings and a n+2 G₁ multiplications and batch BLS requires n+1 pairings. We ignore the cost of the linear number of field operations done in batch Groth16, which are considerably cheaper than group operations and pairings. If we let a G₁ exponentiation be the unit cost and assume a pairing costs 4 units, then we arrive at the bounds above.

Benchmarks can of course produce more accurate bounds here, but these shouldn't be too far off.

Aug 18 '20 18:08 psivesely

The bounds above can also be used to optimize for clients who wish to verify txs/ block headers in multiple epochs that are close together. Instead of providing a proof to each epoch as I suggest above, if any two are close enough in the sense of the bound above then you can just send the SNARKs needed to get to the first one and the diffs, multisigs, bitmaps the client needs to compute and verify the public key set for the next one.

Aug 18 '20 18:08 psivesely