cbor2 icon indicating copy to clipboard operation
cbor2 copied to clipboard

C-extension does allow customizing serialization

Open vamega opened this issue 1 year ago • 2 comments

Things to check first

  • [X] I have searched the existing issues and didn't find my bug already reported there

  • [X] I have checked that my bug is still present in the latest release

cbor2 version

5.5.1

Python version

3.11

What happened?

The behaviour of the serializer differs between the C extension and the Python implementation of this library.

The Python implementation allows customizing the by updating the value of the _encoders map for known types. The C-implementation as an optimization does not use this map and defines it's own implementations of encoders for the known types. This comment on the original issue adding the C implementation shows that it was intended that the C implementation allow for the canonical parameter to be set to a value of 2 to opt into the slower codepath where the Python defined encoders were used.

The two changes below appear to be the cause of this:

  • https://github.com/agronholm/cbor2/pull/77 enforced that enc_style is only ever 0 or 1 (0 when initialized, 1 if the input is exactly 1).
  • This commit changed the parsing of the field to parse it as a predicate and not a byte.

A few questions here:

  1. I recognize that modifying _encoders is not necessarily a supported interface for the library given how the variable is named? If that's not the case, is there no supported way to perform this substitution?
  2. If allowing the C implementation to use the Python defined encoders, how should this be exposed? Go back to overloading the canonical parameter? Introduce an Enum as a value for that field?

How can we reproduce the bug?

The following test case should show the difference. The python implementation will pass, the C implementation will fail.

import calendar

def test_encoder_replacement(impl):
    def to_epoch(dt: datetime) -> float:
        millis = int(dt.microsecond / 1000.0) / 1000.0
        return calendar.timegm(dt.astimezone(timezone.utc).timetuple()) + millis


    def encode_datetime(self, obj: datetime):
        self.encode_float(to_epoch(obj))


    dt = datetime(2013, 3, 21, 20, 4, 0, tzinfo=timezone.utc)
    expected = unhexlify("fb41d452d9ec000000")
    with BytesIO() as fp:
        encoder = impl.CBOREncoder(fp, canonical=2)
        encoder._encoders[datetime] = encode_datetime
        encoder.encode(dt)
        serialized = fp.getvalue()

vamega avatar Nov 15 '23 20:11 vamega

Curiously enough, this could have been missed in version 5.4.3 if people were accessing the encoder via cbor2.encoder instead of the top level module. The C extension didn't shadow namespaces other than the top level one.

vamega avatar Nov 15 '23 20:11 vamega

None of the serialization libraries I know of allow overriding the handling of built-in types. This was never supported, as you correctly deduced from the naming of the _encoders private variable. What's your use case?

agronholm avatar Nov 16 '23 10:11 agronholm