cbor2 icon indicating copy to clipboard operation
cbor2 copied to clipboard

Preferred Serialization and Canonical encoding in CBOR

Open Sekenre opened this issue 7 years ago • 11 comments

The CBOR mailing list has been discussing the definition of Canonical in the standard and have been making changes. I wanted to document these and maybe discuss how they might be implemented in cbor2.

Updated draft standard: https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis

What constitutes a true canonical encoding can be redefined at will by protocol implementers, CBOR standard provides guidelines.

A general encoder/decoder like cbor2 will need to support a number of variations and validate them.

Constraints which have been discussed on the ietf mailing list and in the updated draft:

  • Shortest possible float representations
  • Fixed float representations
  • Fixed integer representations
  • Maps sorted by lexicographical ordering of encoded value (DRAFT)
  • Maps sorted by ascending length then lexicographic (RFC7049)

There may be tagged arrays created for fixed length binary encodings of float values. (Tag values TBD)

See: https://datatracker.ietf.org/doc/draft-ietf-cbor-array-tags/

Decoders may need to validate these by raising errors if the following conditions are met:

  • Indefinite length types
  • Floating point values not in shortest form
  • Floating point values not in fixed representation
  • Integers not in shortest form
  • Integers not in Fixed form
  • Unsorted maps
  • Maps sorted with the wrong algorithm
  • Maps with duplicate keys
  • Incorrect tag type

Instead of a single canonical=True argument there needs to be separate flags for each potential constraint.

For example, if a device expects only 16bit floating point data you could create the encoder like this:

encoder = CBOREncoder(f, float_format="binary16")
encoder.encode(data)

Or for a minimal float encoding and sorted maps using the encoded length

encoder = CBOREncoder(f, float_format="minimal", sort_maps=True, sort_by_length=True)
encoder.encode(data)

On the decoding side:

decoder = CBORDecoder(f, validate_floats_as="binary16")
result = decoder.decode()
decoder = CBORDecoder(f, validate_floats_as="minimal",
    validate_map_order=True,
    ordered_by_length=True,
    ignore_duplicate_keys=False)
result = decoder.decode()

Of course these argument names and the way they are set up are just intended as an example.

Sekenre avatar Oct 23 '18 11:10 Sekenre

At this point I think you should become the maintainer of cbor2. How about it?

agronholm avatar Oct 24 '18 06:10 agronholm

I'd be honoured! I am inexperienced but very keen :grinning:

Sekenre avatar Oct 24 '18 11:10 Sekenre

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

fsssosei avatar Nov 02 '18 09:11 fsssosei

What code where?

agronholm avatar Nov 02 '18 10:11 agronholm

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

Hi @fsssosei could you open a new issue for this and I will look at it when I get the chance?

Sekenre avatar Nov 05 '18 09:11 Sekenre

To add some food for thought: Here is an article discussing the different "canonical" encodings in CBOR: https://www.imperialviolet.org/2022/04/17/canonsofcbor.html It also proposes a naming scheme. RFC 7049 seems to describe "three-step" ordering (but could be read ambigously), RFC 8949 describes "one-step" ordering.

Best as I can tell, cbor2 currently implements three-step ordering. For starters, the documentation could point out the different ways a "canonical" CBOR can be canonical, and document the current state in the library.

henryk avatar May 18 '22 10:05 henryk

Absolutely. Thanks @henryk for bringing that article to my attention. I have been playing around with splitting the canonical settings into their own options, i.e. the 3 options for map ordering, fixed or variable-sized floating point. Etc. Then have a backwards compatible default.

Of course it's easy in python, harder in C. I don't think I will implement validating whether something is canonical, but will document how someone could do so.

Sekenre avatar May 23 '22 12:05 Sekenre