cbor2 Preferred Serialization and Canonical encoding in CBOR

The CBOR mailing list has been discussing the definition of Canonical in the standard and have been making changes. I wanted to document these and maybe discuss how they might be implemented in cbor2.

Updated draft standard: https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis

What constitutes a true canonical encoding can be redefined at will by protocol implementers, CBOR standard provides guidelines.

A general encoder/decoder like cbor2 will need to support a number of variations and validate them.

Constraints which have been discussed on the ietf mailing list and in the updated draft:

Shortest possible float representations
Fixed float representations
Fixed integer representations
Maps sorted by lexicographical ordering of encoded value (DRAFT)
Maps sorted by ascending length then lexicographic (RFC7049)

There may be tagged arrays created for fixed length binary encodings of float values. (Tag values TBD)

See: https://datatracker.ietf.org/doc/draft-ietf-cbor-array-tags/

Decoders may need to validate these by raising errors if the following conditions are met:

Indefinite length types
Floating point values not in shortest form
Floating point values not in fixed representation
Integers not in shortest form
Integers not in Fixed form
Unsorted maps
Maps sorted with the wrong algorithm
Maps with duplicate keys
Incorrect tag type

Instead of a single canonical=True argument there needs to be separate flags for each potential constraint.

For example, if a device expects only 16bit floating point data you could create the encoder like this:

encoder = CBOREncoder(f, float_format="binary16")
encoder.encode(data)

Or for a minimal float encoding and sorted maps using the encoded length

encoder = CBOREncoder(f, float_format="minimal", sort_maps=True, sort_by_length=True)
encoder.encode(data)

On the decoding side:

decoder = CBORDecoder(f, validate_floats_as="binary16")
result = decoder.decode()

decoder = CBORDecoder(f, validate_floats_as="minimal",
    validate_map_order=True,
    ordered_by_length=True,
    ignore_duplicate_keys=False)
result = decoder.decode()

Of course these argument names and the way they are set up are just intended as an example.

Oct 23 '18 11:10 Sekenre

At this point I think you should become the maintainer of cbor2. How about it?

Oct 24 '18 06:10 agronholm

I'd be honoured! I am inexperienced but very keen :grinning:

Oct 24 '18 11:10 Sekenre

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

Nov 02 '18 09:11 fsssosei

What code where?

Nov 02 '18 10:11 agronholm

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

Hi @fsssosei could you open a new issue for this and I will look at it when I get the chance?

Nov 05 '18 09:11 Sekenre

To add some food for thought: Here is an article discussing the different "canonical" encodings in CBOR: https://www.imperialviolet.org/2022/04/17/canonsofcbor.html It also proposes a naming scheme. RFC 7049 seems to describe "three-step" ordering (but could be read ambigously), RFC 8949 describes "one-step" ordering.

Best as I can tell, cbor2 currently implements three-step ordering. For starters, the documentation could point out the different ways a "canonical" CBOR can be canonical, and document the current state in the library.

May 18 '22 10:05 henryk

Absolutely. Thanks @henryk for bringing that article to my attention. I have been playing around with splitting the canonical settings into their own options, i.e. the 3 options for map ordering, fixed or variable-sized floating point. Etc. Then have a backwards compatible default.

Of course it's easy in python, harder in C. I don't think I will implement validating whether something is canonical, but will document how someone could do so.

May 23 '22 12:05 Sekenre