Structured cbor
fixes #2975
This PR introduces structured CBOR encoding and decoding
Encoding from/to CborElement
Bytes can be decoded into an instance of CborElement with the [Cbor.decodeFromByteArray] function by either manually
specifying [CborElement.serializer()] or specifying [CborElement] as generic type parameter.
It is also possible to encode arbitrary serializable structures to a CborElement through [Cbor.encodeToCborElement].
Since these operations use the same code paths as regular serialization (but with specialized serializers), the config flags behave as expected
Newly introduced CBOR-specific structures
-
[CborPrimitive] represents primitive CBOR elements, such as string, integer, float boolean, and null. CBOR byte strings are also treated as primitives
Each primitive has a [value][CborPrimitive.value]. Depending on the concrete type of the primitive, it maps to corresponding Kotlin Types such asString,Int,Double, etc. Note that Cbor discriminates between positive ("unsigned") and negative ("signed") integers!
CborPrimitiveis itself an umbrella type (a sealed class) for the following concrete primitives:- [CborNull] mapping to a Kotlin
null - [CborBoolean] mapping to a Kotlin
Boolean - [CborInt] which is an umbrella type (a sealed class) itself for the following concrete types
(it is still possible to instantiate it as the
invokeoperator on its companion is overridden accordingly):- [CborPositiveInt] represents all
Longnumbers≥0 - [CborNegativeInt] represents all
Longnumbers<0
- [CborPositiveInt] represents all
- [CborString] maps to a Kotlin
String - [CborFloat] maps to Kotlin
Double - [CborByteString] maps to a Kotlin
ByteArrayand is used to encode them as CBOR byte string (in contrast to a list of individual bytes)
- [CborNull] mapping to a Kotlin
-
[CborList] represents a CBOR array. It is a Kotlin [List] of
CborElementitems. -
[CborMap] represents a CBOR map/object. It is a Kotlin [Map] from
CborElementkeys toCborElementvalues. This is typically the result of serializing an arbitrary
Example
bf # map(*)
61 # text(1)
61 # "a"
cc # tag(12)
1a 0fffffff # unsigned(268,435,455)
d8 22 # base64 encoded text, tag(34)
61 # text(1)
62 # "b"
# invalid length at 0 for base64
20 # negative(-1)
d8 38 # tag(56)
61 # text(1)
63 # "c"
d8 4e # typed array of i32, little endian, twos-complement, tag(78)
42 # bytes(2)
cafe # "\xca\xfe"
# invalid data length for typed array
61 # text(1)
64 # "d"
d8 5a # tag(90)
cc # tag(12)
6b # text(11)
48656c6c6f20576f726c64 # "Hello World"
ff # break
Decoding it results in the following CborElement (shown in manually formatted diagnostic notation):
CborMap(tags=[], content={
CborString(tags=[], value=a) = CborPositiveInt( tags=[12], value=268435455),
CborString(tags=[34], value=b) = CborNegativeInt( tags=[], value=-1),
CborString(tags=[56], value=c) = CborByteString( tags=[78], value=h'cafe),
CborString(tags=[], value=d) = CborString( tags=[90, 12], value=Hello World)
})
Implementation Details
I tried to stick to the existing CBOR codepaths as closely as possible, and the approach to add tags directly to CborElements is the most pragmatic way of getting expressiveness and convenient use. It does come with a caveat (also taken from the Readme:
Tags are properties of CborElements, and it is possible to mixing arbitrary serializable values with CborElements that
contain tags inside a serializable structure. It is also possible to annotate any [CborElement] property
of a generic serializable class with @ValueTags.
This can lead to asymmetric behavior when serializing and deserializing such structures!
The test cases (and comments in the test cases reflect this
Closing Remarks
I also fixed a faulty hex input test vector that I introduced myself, last year, if I pieced it together correctly (see here) and I amended the benchmarks. (see here).
Since the commits from here will be squashed anyways, I did not care for a clean history.
Full disclosure: This PR incorporates code from a draft generated by Junie (albeit an impressive draft that saved a day of work). This is not a dumb copypasta of AI-generated code. Even if it were already feature-complete It would still not yet be marked ready for review because we have yet to review everything internally. I also want to stress that "we" is not a euphemism. There will be at least two of us reviewing and discussing internally, almost certainly with additional input from other humans in the process of readying this PR.
Performance seems to be OK (fromBytes and toBytes are the baseline on my machine):
| Metric / Benchmark | fromBytes |
fromStruct |
structFromBytes |
toBytes |
structToBytes |
toStruct |
|---|---|---|---|---|---|---|
| Average (ops/ms) | 1205.615 ± 20.541 | 1545.814 ± 50.743 | 2896.728 ± 74.485 | 2089.013 ± 30.152 | 1442.766 ± 32.257 | 2581.397 ± 32.497 |
| Min | 1186.023 | 1458.225 | 2796.131 | 2066.499 | 1404.482 | 2550.026 |
| Max | 1229.778 | 1581.420 | 2960.572 | 2125.658 | 1475.015 | 2619.815 |
| Stdev | 13.586 | 33.563 | 49.267 | 19.944 | 21.336 | 21.495 |
| CI low (99.9 %) | 1185.075 | 1495.071 | 2822.244 | 2058.861 | 1410.509 | 2548.900 |
| CI high (99.9 %) | 1226.156 | 1596.557 | 2971.213 | 2119.165 | 1475.023 | 2613.893 |
My hot takes:
- Deserialising from a structure is fast enough since it is in the same ballpark as deserialising from bytes
- Deserialising into a generic CBOR structure takes twice the time than directly deserialising, which is fine, given that we instantiate much more as even primitives need a containing class and an array of tags
- Serialising a generic CBOR structure to bytes is faster but in the same ballpark as generic to-byte serialisation of arbitrary serializable data
- Serializing to a CBOR structure is slower than to bytes, but OK enough, since it's in the same ballpark and we instantiate more
I just noticed something that looks weird to me. See this test case here that is failing and closely compare expected vs actual.
the byte string is wrapped twice for the reference. ~~I know there were some discussions, but I don't recall them, so I have to ask: why? did I mess this up last year or is this intentional? Because the way I see it, were' wrapping a bytearray instead of encoding it differently~~
EDIT: the test vector is faulty as this comparison fails the same way
Any updates on the open discussion points?
Thanks for alle the comments! I'll have to dig up some memories that have since collected dust to sort some of the issue out and figure some stuff out again from scratch, as I haven't looked into this for many weeks and forgotten about most of the implementation details ;-). So it will take a bit before I'll push changes, addressing issues.