tractor icon indicating copy to clipboard operation
tractor copied to clipboard

Typed messaging and validation

Open goodboy opened this issue 4 years ago • 14 comments

I was originally going to make a big post on pydantic and how we could offer typed messages using that very very nice project despite there being a couple holdups for integration with msgpack.

However, it turns out just today an even faster and msgpack specific project was released: msgspec 🏄🏼

It claims to not only be faster then msgpack-python but also supports schema evolution and other niceties It also has perf bumps when making multiple repeated encode/decode calls which is exactly how we're currently using msgpack inside our Channel.

Overall there looks to be no downside and we'll get typed message semantics fast and free 👍🏼

For reference, I'll leave a bunch of links I'd previously gathered regarding making pydantic work with msgpack:

  • https://github.com/samuelcolvin/pydantic/issues/951
  • https://pydantic-docs.helpmanual.io/usage/dataclasses/
  • https://github.com/samuelcolvin/pydantic/pull/595
  • https://github.com/tiangolo/fastapi/issues/1285
  • https://github.com/MolSSI/QCElemental/blob/master/qcelemental/models/basemodels.py#L121
    • this is just adding a BaseModel.serialize() effectively which looks up a serialize method by name (eg. json, msgpack) but isn't really adding any "native feeling" support nor speed gains afaict.
TODO
  • [ ] support for a msgpack-python custom type serializer for pydantic.BaseModel such that we just implicitly render with .dict() as pack time and load via `Model(**message)`` at decode time?
  • [ ] write ourselves a small bytes-length prefixed framing protocol for msgspec as per the comments in #212
      while header := await stream.receive_all_or_none(4):
          len, = struct.unpack("<I", header)
          # probably want to sanity-check len for not being unreasonably huge
          chunk = await stream.receive_exactly(len)
          # do something with chunk
    
  • [ ] consider offering msgspec as an optional dependency if we end up liking it?

goodboy avatar Feb 24 '21 03:02 goodboy

That's really neat! I was looking at implementing Pydantic in a project a little while ago, and chose not to. It seemed like the API wasn't quite what I was looking for. I was wanting data classes, and confidence that serialization and deserialization were both strict. I'm not quite sure why I concluded that, unfortunately. I knew about the data classes integration with Pydantic, but there was something missing with it that I felt I needed.

msgspec looks pretty cool for when you control the data format, but that definitely wasn't part of what I was doing. (I was writing an API wrapper over a JSON API).

I know many people have gotten a lot of mileage out of Pydantic. It's a great project.

ryanhiebert avatar Feb 24 '21 14:02 ryanhiebert

Yeah alternatively we've been thinking about using capnproto and in particular seeing if we can auto-gen schema from type annotated Python functions.

I think this would be a huge boon since we'd get CBS (capability based sec) for free 🏄🏼.

The only holdup will be figuring out how pycapnp can work with async stuff and if it can help us with the schema gen/loading. There appears to now be asyncio support but not sure how/if that will get in our way or if we can work off that impl to support trio.

goodboy avatar Feb 24 '21 17:02 goodboy

Oh also another notable project (for a tractor dependent that will likely soon be broken out on it's own repo) there is nptyping which may prove useful in automatic serialization of arrays.

goodboy avatar Feb 24 '21 17:02 goodboy

Linking to https://github.com/jcrist/msgspec/issues/25 since we'll likely need nested Structs to make this the most easy to implement (messages containing strictly typed payloads also defined as structs) otherwise there may need to be some finagling to either hack a standard message schema where payload's are decoded specifically as structs or we'll need to just always decode to a dict. It would be better to have the former considering the supposed speed improvement:

Depending on the schema, deserializing a message into a Struct can be roughly twice as fast as deserializing it into a dict.

goodboy avatar Mar 07 '21 17:03 goodboy

in particular seeing if we can auto-gen schema from type annotated Python functions.

Is there an issue for this.

Essentially to do this, we need to:

  1. Parse dataclasses and save Field attributes
  2. Feed this into networkx to build graph with child, isa and 'hasa` relationships
  3. Use the builder pattern over the networkx graph with a dialect (capnproto or probuf etc)

gc-ss avatar May 12 '21 07:05 gc-ss

@gc-ss not yet specifically; feel free to make one of course if you have some ideas and/or want to try it out.

Also, i think this could be easily wrapped in an external repo for use as well; it doesn't have to be tractor specific.

goodboy avatar May 12 '21 11:05 goodboy

Feed this into networkx to build graph with child, isa and 'hasa` relationships

@gc-ss wait why would you need this? Afaiu graph relations aren't relevant here; are you talking about building nested structs as trees or?

goodboy avatar May 12 '21 13:05 goodboy

Afaiu graph relations aren't relevant here; are you talking about building nested structs as trees or?

Consider this:


class A:
    a: int


class B(A):
    b: int

class C(A):
    c: int

class D:
    composes_c: C

Now if we wanted auto-gen schema for type D, we don't want to spit out B. Also, it's possible some schema libraries might want schemas to be ordered in a certain way depending on the dependency tree.

So you need graphs

What do you think?

If this makes sense, I can move these into a different repo and send you a link.

gc-ss avatar May 12 '21 14:05 gc-ss

@gc-ss yah, as I was thinking you mean for composed structs/types.

If this makes sense, I can move these into a different repo and send you a link.

Cool, yeah if you're interested in working on this then for sure. We can also experiment here around the tractor IPC apis and see how it forms out with tinkering, then move it to a new project.

Up to you, I don't have immediate bandwidth for this.

goodboy avatar May 12 '21 14:05 goodboy

~First hold up with msgspec is mentioned in https://github.com/jcrist/msgspec/issues/27, they have no streaming decoder api.~

No longer a problem, we just have to write a prefix framing stream packer; see above.

goodboy avatar May 31 '21 12:05 goodboy

Hmm alternatively to get typing going sooner then later we could just make some pydantic message type handlers. Pretty sure all we'd need it detection of a BaseModel and then serialization with .dict() on encode and decode into a BaseModel(**dict).

Pretty sure we could offer this as an extras dependency as well?

goodboy avatar Jun 06 '21 13:06 goodboy

Linking explanation from https://github.com/jcrist/msgspec/issues/25

goodboy avatar Feb 09 '22 16:02 goodboy

Probably worth noting is dataclass union libs like https://github.com/yukinarit/pyserde

goodboy avatar May 26 '23 19:05 goodboy

Hilarious to see a writeup of what we've been doing in this repo for years 😂 https://kobzol.github.io/rust/python/2023/05/20/writing-python-like-its-rust.html#fnref:2

the part on ADTs is particularly notable as part of this feature work 🏄🏼

goodboy avatar May 26 '23 19:05 goodboy