msgpack-rust icon indicating copy to clipboard operation
msgpack-rust copied to clipboard

Specfication of behavior with changing data types (versioning / schema evolution)

Open bluenote10 opened this issue 2 years ago • 2 comments

I'm looking for a highly compact serialization format that allows for a certain backwards compatibility. rmp-serde looks like a very interesting candidate, because its non-named representation is very concise indeed. What I couldn't answer from studying the docs is the aspect of backwards compatibility, i.e., how rmp-serde handles changes to data types -- in particular when using the compact non-named serialization.

Imagine serializing a struct with fields a and b and storing it on disk. Later an optional field c (with a default) is added. Will that work, or will the data become unreadable? What about removing or renaming fields? Does the order of fields matter, i.e., will the data become unreadable when swapping the field order to b and a? Does that depend on whether a and b have the same or different types?

It would be great if the documentation could specify what kind of assumptions users can make regarding changing data types.

bluenote10 avatar Apr 09 '22 09:04 bluenote10

It's undocumented, because it hasn't been carefully considered and tested. I don't mind committing to keeping specific data representations and compatibility.

If you'd like to rely on some things, please contribute unit tests that ensure they keep working.

AFAIK currently:

  • order does matter when using non-named serialization.
  • names of struct fields don't matter when using non-named serialization.
  • adding default fields at the end of structs should be fine.
  • removal of fields can break non-named serialization. May be fine for named.

kornelski avatar Apr 11 '22 16:04 kornelski

Thanks, this kind of information already helps a lot!

Background: I was basically skimming over possibilities to do highly compact (which implies schema-based / non-self-describing) serialization, combined with some way of dealing with schema evolution. After understanding serde better and playing around with various binary non-self-describing, I've come to the conclusion that serde is not quite the right tool for the job (for reference on versioning https://github.com/serde-rs/serde/issues/1137). All non-self-describing serializers I tried suffer from order dependence, breaking field removals, and lack of support of certain "named only serde features" like skip_serializing_if, which kind of has to be the case since serde lacks protobuf-like field offset annotations or cereal/boost-serialization-like "class versions". There is probably no way to fix it on rmp_serde side without storage overhead, so documenting should be sufficient.

bluenote10 avatar Apr 15 '22 17:04 bluenote10