msgspec icon indicating copy to clipboard operation
msgspec copied to clipboard

Allow encoding and decoding using array_like True and False

Open RynoM opened this issue 2 years ago • 7 comments

I'd like to use the package to minimize a payload. Data is received from an endpoint in JSON and I would like to encode it array-like in messagepack. However this currently seems hard/impossible to do. Say there is some data coming in:

class User(msgspec.Struct, array_like=False):
    name: str

d = {"name": "John"}
u = msgspec.json.decode(bytes(json.dumps(d), encoding="utf-8"), type=User)

I want to encode it for storage/IO in the array-like structure, however I cannot use the already defined Structs for this. e.g. something like:

msgspec.msgpack.encode(u, array_like=True)

After doing operations/storing it or whatever, I'd like to be able to also reverse this process. I.e. turn the array-like messagepack encoded payload back into a human readable json payload.

I'm also having a hard time finding a DIY workaround to make this work. Any tips would be appreciated!

RynoM avatar Jul 06 '22 15:07 RynoM

Thanks for opening this! This seems like a useful case to handle.

Aside: I'm curious to learn more about your use case here (if you're willing to share). Is this an API server-like-thing? What kind of storage are you planning on storing the msgpack'd payloads in? What benefits are you hoping to get out of using msgspec this way?

I'm also having a hard time finding a DIY workaround to make this work. Any tips would be appreciated!

There unfortunately isn't really a good way to hack around this. The best thing I can recommend for now is to define a second struct subclass that defines array_like=True, manually convert the payload to use the new types, then encode:

>>> import msgspec

>>> class User(msgspec.Struct, array_like=False):
...     name: str

>>> class UserArrayLike(User, array_like=True):
...     # no need to duplicate the field definition, this is inherited
...     pass

>>> u = msgspec.json.decode(b'{"name": "john"}', type=User)
>>> u2 = UserArrayLike(u.name)  # manually convert
>>> msgspec.msgpack.encode(u2)
b'\x91\xa4john'

I can see a few different ways to make this work. The main change here is mostly at the config level, making this work in the backends should be fairly straight forward. Note that due to msgspec's design the configuration needs to be applied to the struct class, not to the encoder/decoder.

Option 1: Make array_like accept a dict

This would make array_like (and probably later omit_defaults) accept a dict mapping protocol to value. In your case you'd have:

class User(msgspec.Struct, array_like={"msgpack": True}):
    name: str

The current case of array_like=True would be shorthand for array_like={"json": True, "msgpack": True}.

Option 2: Add protocol-specific config options

This would add new json/msgpack kwargs (or maybe json_options/msgpack_options? idk what a good name would be) that each take a dict mapping config options to apply to their relevant protocols. The config inheritance would then be:

  • value in protocol-specific dict, if present (e.g. json_options={"array_like": True})
  • value in top-level config, if present (e.g. array_like=True)
  • value in base class, if present

So your example would be:

class User(msgspec.Struct, msgpack_options={"array_like": True}):
    name: str

I'm torn between these - right now I'm leaning towards option 2 if only that it makes it easier to add additional protocol-specific options later.

If you have any thoughts on the apis presented here, I'd love to hear them.

jcrist avatar Jul 07 '22 00:07 jcrist

I'm receiving a stream of json data that I want to store in a nosql database. I want to reduce the size of (a large part of) the payload to reduce network usage and storage size while reading/writing. Much data is infrequently read and does not need to be (human) readable/indexable while at rest. Still in the process of comparing this vs compression, but I like the added validation and having the schemas defined, with very good performance. Might also try doing both.

Manually converting is annoying because we have nested structs, with optional values, within lists, within dicts, etc. So would need to do some custom hacky parsing it feels like. Any suggestion for this?

In terms of support,it feels most natural to me to have the ability to do this during encoding/decoding, as I had above.

msgspec.msgpack.encode(u, array_like=True, omit_defaults=True)

But I guess this is a much more fundamental/unwanted change. As for options 1/2, maybe option 1 makes more sense because you don't have to deal with:

class User(msgspec.Struct, array_like=True, msgpack_options={"array_like": False}):
    name: str

But really either seem good :).

RynoM avatar Jul 07 '22 09:07 RynoM

So would need to do some custom hacky parsing it feels like. Any suggestion for this?

Yeah, this seems unpleasant. The good news is the feature I outlined above should be pretty quick to implement. If you're fine waiting a week or two for the next release then there should be no need to hack around this.

jcrist avatar Jul 07 '22 21:07 jcrist

I made a workaround for now to be able to test the scenario, but would be great to have this implemented!

RynoM avatar Jul 11 '22 09:07 RynoM

I made a workaround for now to be able to test the scenario

Glad to hear it! I'm curions - how'd the test turn out? Do you still think this feature would be useful for you?

This is proving a little more complicated to implement than I would have hoped (mostly due to error handling). I have a plan laid out, but it'll require some internal refactoring. I still think this feature makes sense, but don't want to spend the effort (yet) if you don't think you'll still want it.

jcrist avatar Jul 15 '22 14:07 jcrist

I still think it would be useful, but so far we haven't given implementing this priority, so maybe if the 'pretty quick to implement' didn't turn out to be true, this doesn't need to be priority here either.

RynoM avatar Jul 26 '22 13:07 RynoM

Howdy! I'm just popping in to an old issue to ask a general question.

Note that due to msgspec's design the configuration needs to be applied to the struct class, not to the encoder/decoder.

This fact does seem reasonably clear from the API docs - my question is, is this design relatively difficult to change from an implementation perspective?

I ask because I believe this to be a general usability issue - as a matter of purity (as opposed to, potentially, implementation practicality), a schema should not be "complected" with considerations about behavior at the time of construction or deconstruction.

I've run into plenty of cases where I needed to use the exact same schema (with many levels of recursive schemas, resulting in several hundred fields overall) in different circumstances with different behavior, specifically with these types of facilities (e.g. omit_defaults). Redefining the whole schema, with each class recursively, is simply not an option - we'd have to write our own deconstruction code, or not use msgspec in the first place.

This is really my only quibble with msgspec as a whole, which is a very impressive library. I'm just wondering if this is a limitation that could potentially be lifted over time, or if it's baked in to the way msgspec operates at the lowest levels.

petergaultney avatar Feb 05 '24 15:02 petergaultney