cardano-ledger icon indicating copy to clipboard operation
cardano-ledger copied to clipboard

Arbitrary instances with alternate, valid, serializations.

Open JaredCorduan opened this issue 2 years ago • 14 comments

The ledger is very careful to never re-serialize any data structures that will be hashed. This is important since there are multiple valid ways to encode many data structures in CBOR. Developers who use reuse some of the ledger code, however, are not always as aware of how important this is. In particular, sometimes projects will mix and match the ledger code and other third party serialization libraries, leading to confusion about why hashes do not match. See #2943 as an example.

Section 3.9 of the CBOR RFC describes the various choices allowed in CBOR (in the context of describing which choice to make if you want to be canonical).

The ledger itself, however, has to make specific choices about how to serialize, even though the deserializers are flexible.

In order to make it more transparent that some of the ledger serialization is arbitrary (using definite vs indefinite lists, etc), we can write our Arbitrary generators such that they vary these choices.

The structure that seems to lead to the most confusion seems to be Datum. We could address Datums directly, or perhaps write a general way of "twiddling" all of our CBOR encodings.

JaredCorduan avatar Aug 01 '22 20:08 JaredCorduan

If it's helpful, here are the three big encoding variances we've encountered and had to work around when interacting with browser and hardware wallets:

  • definite / indefinite length arrays
  • the ordering of map keys
  • empty list vs omitted key in maps

Quantumplation avatar Aug 01 '22 20:08 Quantumplation

The ledger itself, however, has to make specific choices about how to serialize, even though the deserializers are flexible.

Isn't it very important as well that we define the canonical deterministic serialization choices we made, to keep the hashes reproducible by third-party?

yihuang avatar Aug 02 '22 03:08 yihuang

@yihuang no, it's important that people not be mislead into thinking that there is a canonical representation. There is not a canonical representation. When you sign or hash the data, you must only hash the original bytes, and not a re-serialisation.

It's a very bad security practice to check signatures or hashes on re-serialised data. It must only be done on the original bytes. Otherwise it leads to all sorts of nasty security problems (think txs with different bytes but that have the same hash).

dcoutts avatar Aug 03 '22 13:08 dcoutts

@yihuang no, it's important that people not be mislead into thinking that there is a canonical representation. There is not a canonical representation. When you sign or hash the data, you must only hash the original bytes, and not a re-serialisation.

It's a very bad security practice to check signatures or hashes on re-serialised data. It must only be done on the original bytes. Otherwise it leads to all sorts of nasty security problems (think txs with different bytes but that have the same hash).

But shouldn't there be a specification or something, or do you mean the only specification is the Haskell code itself, what about the clients implemented in other languages?

yihuang avatar Aug 03 '22 14:08 yihuang

@yihuang we have a wire specification (CDDL) for every ledger era. See the table at the top of the readme in this repository. For example, the latest one is here: https://github.com/input-output-hk/cardano-ledger/blob/11e4d4a8ac88adf33baf6b0602635bf37a53803e/eras/babbage/test-suite/cddl-files/babbage.cddl

JaredCorduan avatar Aug 03 '22 16:08 JaredCorduan

@yihuang we have a wire specification (CDDL) for every ledger era. See the table at the top of the readme in this repository. For example, the latest one is here: https://github.com/input-output-hk/cardano-ledger/blob/11e4d4a8ac88adf33baf6b0602635bf37a53803e/eras/babbage/test-suite/cddl-files/babbage.cddl

I mean the spec of the serialization details, so third-parties can reproduce the same result? Is there a complete doc on it that I'm not aware of?

yihuang avatar Aug 03 '22 17:08 yihuang

no, there is no such a document. there is also no good reason for anyone to try to reproduce the exact arbitrary choices that our code uses which is not captured by the CDDL spec. if they are, they are likely doing exactly what we are trying to prevent. See https://github.com/input-output-hk/cardano-ledger/issues/2943#issuecomment-1203989504

JaredCorduan avatar Aug 03 '22 18:08 JaredCorduan

no, there is no such a document. there is also no good reason for anyone to try to reproduce the exact arbitrary choices that our code uses which is not captured by the CDDL spec. if they are, they are likely doing exactly what we are trying to prevent. See https://github.com/input-output-hk/cardano-ledger/issues/2943#issuecomment-1203989504

I can think of at least one case though, implement Cardano in different languages, or do you mean even if the alternative client don't serialize a block in exact way, it still works, because the other nodes won't try to re-serialize it? Hmm, if that's the case, that would make sense though.

yihuang avatar Aug 04 '22 00:08 yihuang

I can think of at least one case though, implement Cardano in different languages, or do you mean even if the alternative client don't serialize a block in exact way, it still works, because the other nodes won't try to re-serialize it?

exactly! if everyone conforms to the CDDL spec, and does not re-serialize, then everything works.

JaredCorduan avatar Aug 04 '22 01:08 JaredCorduan

Exactly. The spec is the CDDL, and there is no need for other interoperable implementations to serialise in exactly the same way so long as they follow the CDDL specification.

dcoutts avatar Aug 04 '22 11:08 dcoutts

@Quantumplation, what do you mean by

* empty list vs omitted key in maps

Soupstraw avatar Aug 10 '22 13:08 Soupstraw

@Soupstraw Consider collateral, for example. If no collateral is specified, the cardano-api code will serialize the transaction body as a map, with key 13 set to an empty list:

84            ; Array of 4 elements
  a8          ; Map with 8 keys
    ...
    0d 9f ff  ; key 13, array; end array

whereas it is also a valid encoding (and the one preferred by hardware wallets, it appears) to just leave the 13 key out of the map entirely

84            ; Array of 4 elements
  a7          ; Map with 7 keys
    ...

The same thing applies to a few other fields, IIRC

Quantumplation avatar Aug 10 '22 14:08 Quantumplation

right, so what @Quantumplation is referring to is not an ambiguity in CBOR, but rather in our CDDL. We often have optional keys in maps, and there is no semantic difference between leaving that key out, or including that key with a mempty value.

JaredCorduan avatar Aug 10 '22 15:08 JaredCorduan

Ahh, I see, thanks for clarifying!

Soupstraw avatar Aug 11 '22 09:08 Soupstraw