msgspec Support encoding `int` and `str` subclasses

Description

Hey!

Was trying out the project in an attempt to experiment with it and hopefully be able to utilize it in one of my projects, as it looks really promising. One of the issues I ran into was when it came to encoding an int or str subclass.

import msgspec


class Foo(int):
    ...


class Bar(str):
    ...


try:
    msgspec.json.encode(Foo(1))
except Exception as ex:
    print(ex)

try:
    msgspec.json.encode(Bar("testing"))
except Exception as ex:
    print(ex)

Encoding objects of type Foo is unsupported
Encoding objects of type Bar is unsupported

I found that the relevant code for this is https://github.com/jcrist/msgspec/blob/8233e5b55d6e04b921a09f1eed6e9f5d44dce5da/msgspec/_core.c#L10276 but I dont have enough knowledge in CPython's C API to know if there an efficient way to do perform this, but (I believe) the fix should be as easy as just checking the bases instead of only the type. I dont mind digging around to try and fix this (if deemed appropriate) with some pointers in the right direction :)

Thanks again for the effort put in the lib! Looks really really promising!!

Feb 02 '23 12:02 davfsa

A way I can think on how to fix this issue would be to check for the __int__ and __str__ magic methods respectively and call them if defined.

Feb 02 '23 12:02 davfsa

Thanks for opening this. These types are intentionally unsupported right now - see my comments in #248 for the reasoning behind this decision. Once we redo our extension mechanism to allow for selectively overriding how types are encoded/decoded it'll be possible for us to natively handle scalar-type (e.g. int, str) subclasses treating them as their base classes if no extension is registered.

Until then you'll need to manually handle them using an enc_hook yourself. If all you want is to treat them the same as their base classes, the following should work for you:

import msgspec


def enc_hook(obj):
    if isinstance(obj, int):
        return int(obj)
    elif isinstance(obj, str):
        return str(obj)
    raise TypeError(f"Type {type(obj)!r} is unsupported")


class MyInt(int):
    pass


encoder = msgspec.json.Encoder(enc_hook=enc_hook)

encoder.encode([MyInt(1), MyInt(2)])
# b'[1,2]'

Does that satisfy your needs for now? Also, I'm curious - what are your use cases for int/str subclasses?

Feb 02 '23 12:02 jcrist

Ah, ok, so it does seem to be intentional. Thanks for the quick reply!

I did look at using an enc_hook, but it felt too "wrong", mostly because in how much of a hot function that would be, considering the heart of the project is deserializing and serializing data.

And, to answer your question, the reason for this request is mostly because of a custom (and faster) implementation of enum we have for internal use in the library. It should be quite easy to just cast them to the proper type before sending them to msgspec for serialization. And now that I think about it, we also subclass int to provide some extra methods on top related to the specific value that it holds (mostly bit-shifting)

Feb 02 '23 15:02 davfsa

Just had a look through the issue you linked. I understand the use case you defend here:

In constrast, when someone subclasses a scalar like int/str, they sometimes do so to change a serialization behavior. For example, I know of one user who is subclassing int to support encoding integers as hex strings. If we supported all subclasses natively, there'd be no way to change the encode/decode behavior for scalar subclasses in msgspec.

which is quite similar to our second use-case I described, but the way we approached the issue is to keep casting to the class as the "ideal approach" (data is exactly as we want to store it) and then use classmethods to provide an interface for user defined data.

Feb 02 '23 15:02 davfsa

Thanks for the info!

how much of a hot function that would be, considering the heart of the project is deserializing and serializing data.

This really depends on how much of your message is composed of custom types. In my usage messages are 99% builtin supported types, with the rare custom type. In this case the call overhead is negligible. If things skew the other way, then yeah that'll be slower. Casting before serializing is one option (if that's easy) and should be faster than relying on dispatch if done correctly. In the long run though we should be able to support your use case efficiently without losing the flexibility of custom serializers. We're just not there yet.

I'm going to leave this open for now until I writeup an issue documenting the plan to redo extensions.

Feb 03 '23 14:02 jcrist

If needed, would be happy to help out :)

Feb 03 '23 15:02 davfsa

Thanks! I would love to expand the contributor bandwidth on this project. First step I suppose is adding developer docs (#293). :)

Feb 03 '23 15:02 jcrist

Will have a look into how to set it up and get started :)

My C knowledge is basically non-existent, but will try my best

Feb 03 '23 16:02 davfsa

msgspec msgspec copied to clipboard

Support encoding `int` and `str` subclasses

Description

msgspec
msgspec copied to clipboard