kotlinx.serialization icon indicating copy to clipboard operation
kotlinx.serialization copied to clipboard

Structured cbor

Open JesusMcCloud opened this issue 5 months ago • 5 comments

fixes #2975

This PR introduces structured CBOR encoding and decoding

Encoding from/to CborElement

Bytes can be decoded into an instance of CborElement with the [Cbor.decodeFromByteArray] function by either manually specifying [CborElement.serializer()] or specifying [CborElement] as generic type parameter.
It is also possible to encode arbitrary serializable structures to a CborElement through [Cbor.encodeToCborElement].

Since these operations use the same code paths as regular serialization (but with specialized serializers), the config flags behave as expected

Newly introduced CBOR-specific structures

  • [CborPrimitive] represents primitive CBOR elements, such as string, integer, float boolean, and null. CBOR byte strings are also treated as primitives
    Each primitive has a [value][CborPrimitive.value]. Depending on the concrete type of the primitive, it maps to corresponding Kotlin Types such as String, Int, Double, etc. Note that Cbor discriminates between positive ("unsigned") and negative ("signed") integers!
    CborPrimitive is itself an umbrella type (a sealed class) for the following concrete primitives:

    • [CborNull] mapping to a Kotlin null
    • [CborBoolean] mapping to a Kotlin Boolean
    • [CborInt] which is an umbrella type (a sealed class) itself for the following concrete types (it is still possible to instantiate it as the invoke operator on its companion is overridden accordingly):
      • [CborPositiveInt] represents all Long numbers ≥0
      • [CborNegativeInt] represents all Long numbers <0
    • [CborString] maps to a Kotlin String
    • [CborFloat] maps to Kotlin Double
    • [CborByteString] maps to a Kotlin ByteArray and is used to encode them as CBOR byte string (in contrast to a list of individual bytes)
  • [CborList] represents a CBOR array. It is a Kotlin [List] of CborElement items.

  • [CborMap] represents a CBOR map/object. It is a Kotlin [Map] from CborElement keys to CborElement values. This is typically the result of serializing an arbitrary

Example

bf                                 # map(*)
   61                              #   text(1)
      61                           #     "a"
   cc                              #   tag(12)
      1a 0fffffff                  #     unsigned(268,435,455)
   d8 22                           #   base64 encoded text, tag(34)
      61                           #     text(1)
         62                        #       "b"
                                   #     invalid length at 0 for base64
   20                              #   negative(-1)
   d8 38                           #   tag(56)
      61                           #     text(1)
         63                        #       "c"
   d8 4e                           #   typed array of i32, little endian, twos-complement, tag(78)
      42                           #     bytes(2)
         cafe                      #       "\xca\xfe"
                                   #     invalid data length for typed array
   61                              #   text(1)
      64                           #     "d"
   d8 5a                           #   tag(90)
      cc                           #     tag(12)
         6b                        #       text(11)
            48656c6c6f20576f726c64 #         "Hello World"
   ff                              #   break

Decoding it results in the following CborElement (shown in manually formatted diagnostic notation):

CborMap(tags=[], content={  
    CborString(tags=[],   value=a) = CborPositiveInt( tags=[12],     value=268435455),  
    CborString(tags=[34], value=b) = CborNegativeInt( tags=[],       value=-1),  
    CborString(tags=[56], value=c) = CborByteString(  tags=[78],     value=h'cafe),  
    CborString(tags=[],   value=d) = CborString(      tags=[90, 12], value=Hello World)  
})

Implementation Details

I tried to stick to the existing CBOR codepaths as closely as possible, and the approach to add tags directly to CborElements is the most pragmatic way of getting expressiveness and convenient use. It does come with a caveat (also taken from the Readme:

Tags are properties of CborElements, and it is possible to mixing arbitrary serializable values with CborElements that contain tags inside a serializable structure. It is also possible to annotate any [CborElement] property of a generic serializable class with @ValueTags.
This can lead to asymmetric behavior when serializing and deserializing such structures!

The test cases (and comments in the test cases reflect this

Closing Remarks

I also fixed a faulty hex input test vector that I introduced myself, last year, if I pieced it together correctly (see here) and I amended the benchmarks. (see here).

Since the commits from here will be squashed anyways, I did not care for a clean history.

JesusMcCloud avatar Jul 05 '25 05:07 JesusMcCloud

Full disclosure: This PR incorporates code from a draft generated by Junie (albeit an impressive draft that saved a day of work). This is not a dumb copypasta of AI-generated code. Even if it were already feature-complete It would still not yet be marked ready for review because we have yet to review everything internally. I also want to stress that "we" is not a euphemism. There will be at least two of us reviewing and discussing internally, almost certainly with additional input from other humans in the process of readying this PR.

JesusMcCloud avatar Jul 08 '25 10:07 JesusMcCloud

Performance seems to be OK (fromBytes and toBytes are the baseline on my machine):

Metric / Benchmark fromBytes fromStruct structFromBytes toBytes structToBytes toStruct
Average (ops/ms) 1205.615 ± 20.541 1545.814 ± 50.743 2896.728 ± 74.485 2089.013 ± 30.152 1442.766 ± 32.257 2581.397 ± 32.497
Min 1186.023 1458.225 2796.131 2066.499 1404.482 2550.026
Max 1229.778 1581.420 2960.572 2125.658 1475.015 2619.815
Stdev 13.586 33.563 49.267 19.944 21.336 21.495
CI low (99.9 %) 1185.075 1495.071 2822.244 2058.861 1410.509 2548.900
CI high (99.9 %) 1226.156 1596.557 2971.213 2119.165 1475.023 2613.893

My hot takes:

  • Deserialising from a structure is fast enough since it is in the same ballpark as deserialising from bytes
  • Deserialising into a generic CBOR structure takes twice the time than directly deserialising, which is fine, given that we instantiate much more as even primitives need a containing class and an array of tags
  • Serialising a generic CBOR structure to bytes is faster but in the same ballpark as generic to-byte serialisation of arbitrary serializable data
  • Serializing to a CBOR structure is slower than to bytes, but OK enough, since it's in the same ballpark and we instantiate more

JesusMcCloud avatar Aug 07 '25 11:08 JesusMcCloud

I just noticed something that looks weird to me. See this test case here that is failing and closely compare expected vs actual.

the byte string is wrapped twice for the reference. ~~I know there were some discussions, but I don't recall them, so I have to ask: why? did I mess this up last year or is this intentional? Because the way I see it, were' wrapping a bytearray instead of encoding it differently~~
EDIT: the test vector is faulty as this comparison fails the same way

JesusMcCloud avatar Aug 07 '25 12:08 JesusMcCloud

Any updates on the open discussion points?

JesusMcCloud avatar Oct 07 '25 11:10 JesusMcCloud

Thanks for alle the comments! I'll have to dig up some memories that have since collected dust to sort some of the issue out and figure some stuff out again from scratch, as I haven't looked into this for many weeks and forgotten about most of the implementation details ;-). So it will take a bit before I'll push changes, addressing issues.

JesusMcCloud avatar Dec 05 '25 09:12 JesusMcCloud