celestia-specs
celestia-specs copied to clipboard
Define Data Availability Sampling (DAS)
Define the process and parameters (number of samples) for doing Data Availability Sampling (DAS).
I'm not sure the number of samples has to be a part of the spec because clients can decide this for themselves, however the spec can make secure recommendations.
Just for my understanding: will this include how we use ipfs?
will this include how we use ipfs?
Eventually, yes. But it will probably start with more conservative goals, e.g. just defining what to sample. Then move on to how to sample as that gets iterated on in the implementation.
A recent topic that came up is "do we need to sample from both rows and columns for DAS?" (cf. https://github.com/lazyledger/lazyledger-core/pull/270)
Mechanism 1 (from paper)
The mechanism from the paper (https://github.com/lazyledger/lazyledger-core/pull/270#discussion_r610034373) involves sampling shares randomly from both rows and columns. In other words, each share in the EDS (extended data square) has two possible addresses: the pairs (row, col) (i.e. column index against a row root) and (col, row) (i.e. a row index against a column root).
A fraud proof consists of Merkle proofs against a single row/column root for at least k (the width of the ODS (original data square)) shares.
Mechanism 2 (potential alternative)
We note that having two addresses for shares introduces duplication without any additional security. Instead, we can only sample from rows. The space of shares is identical to above, so each individual share has exactly the same probability of being sampled. The only difference is that only Merkle proofs against row roots are downloaded.
A fraud proof consists of Merkle proofs against a single row root (for a fraudulent row), or a Merkle proof against multiple row roots (for a fraudulent column) for at least k shares.
Discussion
Pros of alternative:
- If all nodes by default configuration only sample from rows, there's twice as much bandwidth/capacity available for the same shares (conversely, sampling from both rows and columns halves the bandwidth/capacity available).
Cons of alternative:
- Fraud proofs for columns are slightly larger.
I don't immediately see any security issues with the alternative, but maybe I'm missing something.
Fraud proofs for columns are slightly larger.
The question is how much larger in the worst-case and how often do we expect this to happen (and who is impacted by this/who needs to download and process these fraud proofs)? As far as I understand, it will only be (some particular kind of) full-nodes who will be generating these fraud proofs (and it is expected to be a really, really rare event). What kind of clients will be validating these exactly?
The question is how much larger in the worst-case
Without deduplication of sibling nodes, Merkle proofs against a single root vs Merkle proofs against k roots is basically the same size. Nodes are expected to have all the row and column roots, so those don't need to be included in the proof explicitly.
who is impacted by this/who needs to download and process these fraud proofs What kind of clients will be validating these exactly?
This would affect light nodes mostly in terms of processing fraud proofs.