Preferred Serialization and Canonical encoding in CBOR
The CBOR mailing list has been discussing the definition of Canonical in the standard and have been making changes. I wanted to document these and maybe discuss how they might be implemented in cbor2.
Updated draft standard: https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis
What constitutes a true canonical encoding can be redefined at will by protocol implementers, CBOR standard provides guidelines.
A general encoder/decoder like cbor2 will need to support a number of variations and validate them.
Constraints which have been discussed on the ietf mailing list and in the updated draft:
- Shortest possible float representations
- Fixed float representations
- Fixed integer representations
- Maps sorted by lexicographical ordering of encoded value (DRAFT)
- Maps sorted by ascending length then lexicographic (RFC7049)
There may be tagged arrays created for fixed length binary encodings of float values. (Tag values TBD)
See: https://datatracker.ietf.org/doc/draft-ietf-cbor-array-tags/
Decoders may need to validate these by raising errors if the following conditions are met:
- Indefinite length types
- Floating point values not in shortest form
- Floating point values not in fixed representation
- Integers not in shortest form
- Integers not in Fixed form
- Unsorted maps
- Maps sorted with the wrong algorithm
- Maps with duplicate keys
- Incorrect tag type
Instead of a single canonical=True argument there needs to be separate flags for each potential constraint.
For example, if a device expects only 16bit floating point data you could create the encoder like this:
encoder = CBOREncoder(f, float_format="binary16")
encoder.encode(data)
Or for a minimal float encoding and sorted maps using the encoded length
encoder = CBOREncoder(f, float_format="minimal", sort_maps=True, sort_by_length=True)
encoder.encode(data)
On the decoding side:
decoder = CBORDecoder(f, validate_floats_as="binary16")
result = decoder.decode()
decoder = CBORDecoder(f, validate_floats_as="minimal",
validate_map_order=True,
ordered_by_length=True,
ignore_duplicate_keys=False)
result = decoder.decode()
Of course these argument names and the way they are set up are just intended as an example.
At this point I think you should become the maintainer of cbor2. How about it?
I'd be honoured! I am inexperienced but very keen :grinning:
The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer
What code where?
The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer
Hi @fsssosei could you open a new issue for this and I will look at it when I get the chance?
To add some food for thought: Here is an article discussing the different "canonical" encodings in CBOR: https://www.imperialviolet.org/2022/04/17/canonsofcbor.html It also proposes a naming scheme. RFC 7049 seems to describe "three-step" ordering (but could be read ambigously), RFC 8949 describes "one-step" ordering.
Best as I can tell, cbor2 currently implements three-step ordering. For starters, the documentation could point out the different ways a "canonical" CBOR can be canonical, and document the current state in the library.
Absolutely. Thanks @henryk for bringing that article to my attention. I have been playing around with splitting the canonical settings into their own options, i.e. the 3 options for map ordering, fixed or variable-sized floating point. Etc. Then have a backwards compatible default.
Of course it's easy in python, harder in C. I don't think I will implement validating whether something is canonical, but will document how someone could do so.