celestia-core
celestia-core copied to clipboard
Investigate and decide on a different length delimiter for messages
See primer for message and share layout
We want to be able to easily determine the difference between some number of messages in a single namespace. This means that we will likely have to have some encoding that is enforced by validators.
Currently, the encoding consists of adding a variable sized length delimiter being added to messages only in the first share that they occupy. This is described in the specs here. In this case, the length delimiter is determined using the PutUvarint function in the go standard library. This is problematic, as its complicates the decoding of this data, particularly in environments such as the EVM or zk applications.
To fix this, we will likely still want validators to prepend a delimiter to the message or shares, but will do so in an easily standardized way.
Suggested mechanisms:
- Use an encoding scheme similar to how we encode contiguous shares, where we include a constant sized length delimiter in each share.
- Use some constant number of bytes, but only prepend them to the first share
Since messages are not contiguous (any given share can only contain a single message), that the length delimiter doesn't actually have to depict the length of the message, instead it could depict the number of shares used for a given message.
this issue is a summary of a synchronous discussion from @musalbas, @adlerjohn, @tzdybal. I might have missed something, please add missing detail below.
This is problematic, as its complicates the decoding of this data, particularly in environments such as the EVM or zk applications.
Is this really a problem @musalbas @adlerjohn ? Varint encoding is not complicated or complex, especially not for unsigned numbers. Ofc fixed size encoding is much simpler but it limits the max length or might waste yet another byte or two in most cases. This adds up for sure. Not saying we should not change this but it is a tradeoff to consider.
Regarding the suggested mechanisms:
Use an encoding scheme similar to how we encode contiguous shares, where we include a constant sized length delimiter in each share.
So this size length delimiter would just say how many shares are still to come, right? And the shares are fixed size as well (currently 256 bytes, soon 512 bytes). I think the constant can be small here because of that. 2 bytes would be sufficient, as this means we can encode up to 65k shares after that. As Evan mentioned above we already lay out "contiguous shares" like this. Do we still need to introduce another prefix? Or is this more about applying this to all shares?
Use some constant number of bytes, but only prepend them to the first share
That also works. Again 2 bytes should be sufficient. I think currently we are doing something very similar but encoding the message length as bytes instead of the number of shares (varint encoded). So I think the suggestion is that we change this to fixed size encoding and to the number of shares to follow. That is the simpler of the two approaches as the code is currently closer to this anyways.
P.S: I do not fully understand this sentence:
We want to be able to easily determine the difference between some number of messages in a single namespace.
Note that this is a breaking change and however we decide we should do it asap. Particularly if we deploy this on some public testnet it will require a re-genesis. Which could be inconvenient for app devs.
Can we get agreement that we encode a fixed sized (2 bytes) prefix at the beginning of a message (the data a PfD pays for) and we are done with it? I think it is fine that we waste a byte potentially as long as it is metered as gas accordingly and potentially payed for.
Can we document the decision here int this repo as well as a light encoding spec for app devs and clients?
The spec served as a previous guideline of implementing this and the implementation is aligned with this as Evan mentioned above: https://celestiaorg.github.io/celestia-specs/latest/specs/data_structures.html#arranging-available-data-into-shares and https://celestiaorg.github.io/celestia-specs/latest/specs/data_structures.html#serialization
Instead of updating the spec, let's have an ADR describing that simple change and decision and a spec regarding the actual serialization (here or in app). I think a lot can be copied from the archived specs but it should be very concrete and easy to understand.
In my opinion, the point of having length prefix, is ability to store messages continuously in one or more shares (from multiple small messages in single share to multiple big messages each spanning multiple shares), and being able to parse those messages back out of shares, without querying for original PFD.
I can't see how we can have above properties with length prefixed in each share.
I also opt for expressing length in bytes (not shares), because:
- some serialization formats (including protobuf) expect input of exact size (without extra padding).
- this enables storing multiple messages per share
We may want to consider versioning the shares in order to allow for backwards-compatibility for on-chain smart contracts or programs that need to parse the shares but are not upgradeable.
I've opened a separate issue for the above so we can discuss the length delimiter here only.
@tzdybal brought up some good points above. If we use bytes not number of shares, then 2 bytes might not be enough IMO.
If we use bytes, I would still vote for varint encoding!
OK, after thinking about this further, I do think varint encoding is not so bad after all: it will always work (up to 10 bytes for an unsigned integer ... ), it is well established, efficient, has wide language support and is simple enough to be implemented in any other new language.
I do not buy the idea that this too difficult in a zk-snark context (if a simple for loop and a few byte ops are too much ... what useful thing can you do on that snark anyways 🤷🏼).
A solidity implementation is already here and I'm sure that is not the only one.
Also is parsing these shares sth we expect to happen in a zk-snark (or evm) context frequently anyways? If so, why?
One downside of using shares is that, depending on the number of bytes used to encode the length delimiter as varint, the number of shares could change. While the number of bytes for the data does not change regardless of how the length delimiter is encoded.
Variant encoding isn't so bad in ZK---estimated on the order of a few hundred times more expensive than an addition. But an addition is really cheap and this varint decoding would only have to be done once per data. In the optimistic case, it would only have to be done for fraud proofs.
I'm not convinced yet of an attack vector that requires length delimiter per share instead of per data.
What do you think about encoding both the sharelen amount and datalen, but as different encoding layers?
The main reason to encode shareslen is to avoid prepending namespaces to each share in EDS. Currently, we prepend namespaces to each share of the EDS so that the namespaces can be reconstructed. Instead, we could keep the namespace only in the first “meta” share, plus the number of shares after it that belong to the same namespace. Essentially, this allows us to fit more(nIDSize*sharelen - sharelen) application data into the block.
So something like version+namespace+sharelen+appdata would be the “meta” share, keeping metadata of the first layer of encoding injected into the celestia-node protocol stack, and datalen(or anything else) can be application-defined (meta)data format as a layer of encoding on top.
API-wise, with this approach, shares hide away from the user in favor of the data streams per namespace per block, which is a concatenation of all the shares per namespace minus metadata in the first meta share until the first encounter of the EONamespace symbol. This would also allow us to have msgs/streams less than one share in size(cc @tzdybal). Further, those streams per block can be concatenated into one higher-order stream per namespace. E.g., you rcv one ‘io.Reader’ by a namespace from which you can read data coming from contiguous blocks as a single data stream. Innit dope?
Cc @adlerjoh, @musalbas, @liamsi
How would you determine if a share is the first share of a namespace, under that proposal?
It is not a problem to find the “meta” share if it's within the same axis, but if it's outside, row and cols start to depend on each other for verification. For example, column 3 needs to be verified, but the “meta” share with a namespace for the share(Col:3, Row: 4) is located at row 3, so tree reconstruction for col 3 is dependent on row 3. If this is a problem, then the proposal is disqualified, particularly not keeping namespaces at each share part of it. Although, I don't understand why dependency could be a problem besides implementation complexity. Befps still can be generated and we just won't be able to verify an axis instantly.
Well, cycle dependencies might be an issue, need to think more about this.
Removing namespace IDs from leaves would have implications on the ability to verify Merkle proofs in the NMT, and thus erasure coding fraud proofs, that would be have to be analysed.
Closing in favor of https://github.com/celestiaorg/celestia-core/issues/839 and https://github.com/celestiaorg/celestia-app/issues/659? Please re-open if this needs further attention.