KEEP icon indicating copy to clipboard operation
KEEP copied to clipboard

Kotlin serialization compiler plugin

Open elizarov opened this issue 5 years ago • 29 comments

This in an issue to discuss proposal for Kotlin serialization compiler plugin.

elizarov avatar Sep 07 '18 14:09 elizarov

  • Reflection API is not available in Kotlin/JS and Kotlin/Native, so it is impossible to use a runtime-based solution.
  • Kapt is not supported by Kotlin/JS and Kotlin/Native, so it is impossible to use annotation processing.
  • Standalone code generators do not work well with incremental compilation and it is non-trivial to include them as a build phase.

This is more of a meta note and I don't want to derail comments on this issue (and as a relatively satisfied consumer of multiplatform serialization), but this problem needs a better general solution. Not every dynamic programming requirement should have to be solved by a first-party compiler plugin and IDE plugin. This use case is a perfect example of something that shouldn't have to be baked into Kotlin and it's a failure of the language tooling that it currently must be.

JakeWharton avatar Sep 07 '18 14:09 JakeWharton

@JakeWharton I absolutely agree! Moreover, let me add here that we have a general idea on how this can be done. The full description of this imagined mechanism is out of the scope of this concrete serialization plugin proposal, but the name of the idea already tells a lot -- we call it "compile-time reflection". The basic idea is that you should be able to perform reflection on the static structure of your code during compile time from a regular Kotlin code. This is somewhat similar to the proposed meta-programming facilities for C++, but I believe that we can do it in a much more type-safe and way more toolable fashion by leveraging the concept of Kotlin inline functions. You can think of it as "inline functions on steroids". The key observation here is that you can already do serialization via run-time reflection on Kotlin/JVM. Doing it in compile-time is just a performance optimization. A sufficiently advanced compiler can figure out that the static structure of the code (like a list of class members) does not change at run-time and perform an advanced "constant propagation" and "loop unrolling" during compile-time to turn a code that is written using run-time reflection into a code that is fully templated for each specific class. (P.S. but first we need to finish house-keeping and switch all Kotlin compiler backends to an IR-based implementation to enable this kind of stuff)

elizarov avatar Sep 07 '18 15:09 elizarov

As my friend points out, reflection is defined as "The ability of a computer program to examine, introspect and modify its own structure and behavior at runtime". Consider calling it "compile-time type introspection" to avoid clashing with this definition.

Also, this sounds a lot like some kind of macro functionality, like Scala has? I don't know how they work, but just asking if it's similar. It sounds interesting.

Dico200 avatar Sep 08 '18 00:09 Dico200

@Dico200 We don't plan any macros in Kotlin. The idea is that you should write code using runtime reflection API (and you can debug it as such), but since the structure of the program is statically known at compile-time, the corresponding values (property names, etc) can be inlined during compile-time to the point where there are no run-time reflexive calls left for the run-time itself.

elizarov avatar Sep 08 '18 06:09 elizarov

@Dico200 If I understand @elizarov correctly, here is my interpretation of what's going on.

There are already two different levels of code generation in the compiler:

  1. When it sees a statement, it compiles it to bytecode
  2. When it sees a function call and that function is inline, instead of generating bytecode that calls that function, it inlines its body.

What @elizarov is proposing is to extend this compilation intelligence even further: make the compiler understand reflection code and replace it with equivalent, but static, code.

cbeust avatar Sep 09 '18 13:09 cbeust

This plugin is similar to parcelize plugin

That is true. But I really curious how should real code look when you need serialization (for example for request/response parsing) and parcelable (to save a screen instance state or pass this data between Android components. It should be a very common case.

Annotate with @Parcelize and with @Serializable at the same time? Should probably work even now. One more option write parcelable implementation for kotlinx.serialization but such code can be used only to write to Parcel manually and cannot be used to generate Parcelable implementation of a class.

Even if 2 annotations will work would be good to have some integrated solution, that at least do not generate very similar bytecode, or even allow generate Parcelable implementation for classes with @Serializable same as @Parcelize (maybe with compiler plugin flag)

gildor avatar Sep 10 '18 09:09 gildor

@gildor Serialization plugin generates code which is abstract over a storage; one may write the implementation of CompositeEncoder which write primitives to a Parcel and therefore make all @Serializable classes effectively Parcelable (they will not implement this interface indeed, but they can be saved and read from parcel via kotlinx.serialization API) Is implementing Parcelable interface by the concrete class your main concern? In this case, I see a way to solve this: tune @Parcelize plugin in such a way that generated methods will delegate to kotlinx.serialization API. But this still requires both annotations on the class, or another composite mechanism. However, these methods will be very small and can be easily written by hand in case you really need them.

sandwwraith avatar Sep 10 '18 09:09 sandwwraith

That's a nice and simplified explanation. Thank you cbeust, elizarov. Very interesting idea.

Dico200 avatar Sep 10 '18 18:09 Dico200

@sandwwraith Yes, that exactly what I meant, that it's pretty straightforward to write encoder for Parcel.

Is implementing Parcelable interface by the concrete class your main concern

Yes, this is can be a problem sometimes, for example, you cannot use such class without Parcelable implementation as a property of another parcelable class.

But this still requires both annotations on the class, or another composite mechanism.

I don't think that it's a big problem in general, also, probably, can be solved with meta-annotation to have one annotation for both use cases.

I just would like to have some integrated solution and the same generated bytecode.

gildor avatar Sep 11 '18 08:09 gildor

Implementing Serializable is enough. You can write a Serizalizable to a Parcel, and you can also put one in a Bundle.

If the private methods are generated for the Serializable implementation (instead of letting it be handled the full reflection way), it can be in fact more efficient than full Parcel, and you can see it the end of this article.

LouisCAD avatar Sep 11 '18 08:09 LouisCAD

@LouisCAD I think this topic is not a right place for a holy war between Parcel vs Serializable and discussing the problems of Java Serializable interface. Reality: Parcelable is recommended way on Android and Kotlin already provides official compiler plugin to work with Parcelable. And this plugin works in a very similar way with kotlinx.serialization and I think this should be somehow discussed and keep in mind during proposal implementation.

You can write a Serizalizable to a Parcel

Yes, but cannot do that for classes annotated with @Serializable, so it's not related to the topic.

gildor avatar Sep 11 '18 09:09 gildor

@gildor

you cannot use such class without Parcelable implementation as a property of another parcelable class

Is this necessary for interop with some existing Android APIs? Because that should not be a problem in most use cases – just use everywhere @Serializable, these properties can be embedded.

@LouisCAD , @kotlinx.serialization.Serializable does not add java.io.Serializable interface to the class.

sandwwraith avatar Sep 11 '18 09:09 sandwwraith

Scanned the document and couldn't find this (but maybe I missed how to do it other way):

List all available KClass with @Serializer.

Use-case:

Having a real-time communication client with arbitrary messages. For example: a game that uses websockets for a bidirectional communication.

This communication channel has arbitrary messages. To differentiate messages, there is a wrapper Packet(type: String, payload: String) class that stores a type with the fqname of the class, and then a payload, that is a serialized message of the type defined by type. Listing all the available KClass with Serializer, one could create a Map, mapping the fqname of the class to a class/serializer, and then being able to deserialize the actual message specified by the type.

Right now there is a workaround that is to keep a list of KClass manually, with all the available classes. But that's far from ideal since the idea of the serialization is to automate things, make it DRY and avoid boilerplate or additional generation steps.

soywiz avatar Sep 11 '18 10:09 soywiz

@sandwwraith

Is this necessary for interop with some existing Android APIs?

Nope, I don't know such APIs. You right, you can use @Serializable everywhere, but not so good if you want to migrate to kotlinx.serialization some existing project (our case).

Another problem that you need some method that converts object to Parcel/Bundle to put it to Bundle, same to read it.

If you decide that those points are not critical and use @Serializable + Parcel writer encoder is good solution, what is future of Parcelize compiler plugin in this case? For me, single universal serialization solution is looking better than 2 official compiler plugins that do similar things and both require support and consideration what you want to use.

The solution when Serializable used but Parcelize is the best IMO, because covers all existing use cases of Pacelize and backward compatible + allows using different encoders/decoders as a bonus.

gildor avatar Sep 11 '18 10:09 gildor

@soywiz I think that approach with 'global map' is quite ad-hoc and for different message types, you probably want to use a polymorphic serialization, discussed in the 'appendix' section. There was mentioned a SerialContext where all polymorphic serializers should be registered. Bulk registration is possible via a concept of SerialModule (not mentioned in KEEP, since it is purely runtime library's entity), which should register all serializers that belong to some scope (e.g. file, package, or library). Currently, SerialModule must be written by hand by implementing the function registerIn(context: MutableSerialContext). In observable future, we can provide a design to automatically generate such module from given file, package, or Kotlin module; I believe that this can solve your problem.

sandwwraith avatar Sep 17 '18 14:09 sandwwraith

I've read the document and have to say that it is overall quite comprehensive and seems to have taken into account many considerations. I found one thing that I feel is an error. In the SerialDescriptor interface it has an isNullable property. As nullability is a use variation of the type (not a type declaration variation) I believe that this is incorrect. It probably should be fun isElementNullable(index:Int): Boolean like isElementOptional.

While the current interface is implementable and can provide the same information it is more cumbersome to access and requires the instantiation of "nullable descriptor" variants for all classes. To ensure that you only create single copies of these nullable variants is cumbersome.

pdvrieze avatar Sep 20 '18 19:09 pdvrieze

@pdvrieze Idea behind this is while optionality is an attribute of a property solely inside the structure, nullability is an attribute of an unbound type (use-site, indeed). A particular serializer works with type; therefore support of nullability is an attribute of serializer expressed in its descriptor. By default, serializers work with non-nullable types; special NullableSerializer adapts them to a usage with nullable types when use-site nullability is encountered by the code generator. Implementing isElementNullable(Int) is straightforward when you have getElementDescriptor(Int) and isNullable (and could be provided as extension function directly in the library); adaptation in backwards direction would require an instance check for a NullableSerializer or NullableDescriptor.

Also, root-type (where you start serialization or schema writing) can be nullable, but can't be optional. Replacing isNullable with isElementNullable would make impossible to detect this without mentioned instance check.

sandwwraith avatar Sep 21 '18 09:09 sandwwraith

Hi guys, I'd like to suggest a better name for "Serialization", for me it should be Wire.

I mean, something like kotlinx.wire.

  • Four character long, hard to beat.
  • Doesn't tell about going in a direction or another. "Serialization" means just serializing, or serializing and deserializing.

I took this name from the great Chronicle-Wire project.

I've written my minimal "wire" library (with similar serialization-encoding separation) and this name is a pleasure to use in application code.

otcdlink-simpleuser avatar Sep 22 '18 07:09 otcdlink-simpleuser

@sandwwraith I see the point about the nullability being a property of the serializer. It requires the usage of the NullableSerializer "wrapper" not used currently, but that would also avoid the serializer copy issue. Based on my experiments it seems that using the wrappers provides for an overall better architecture with much less duplication. In that sense, perhaps the CompositeEncoders encodeInt(...) etc. could be defaulted to forward using the appropriate primitive serializer. Keep the methods open though as they are a worthwhile shortcut for many formats.

pdvrieze avatar Sep 24 '18 11:09 pdvrieze

For your information, I've actually done a port of the new architecture on top of the existing code. I've used that to refactor my xml serialization library (https://github.com/pdvrieze/xmlutil). It actually works quite well and keeps things a lot more rational. The port has some smaller warts (things that cannot be determined with the old compiler plugin - missing info is missing info). One big advantage is that there is normally sufficient information to actually customize the serialization/deserialization as needed by the format.

pdvrieze avatar Sep 26 '18 08:09 pdvrieze

Aside from specific API issues, I believe that this issue is more general than simply being about serialisation. I think it can be generalised to the problem of converting a unidirectional graph into a tree (maybe these algorithms help [https://en.wikipedia.org/wiki/Minimum_spanning_tree]).

Because, in languages such as Java, Javascript, and Kotlin, objects construct a graph, not a tree. But to serialise them, we need a tree. (I can imagine there are other use cases, e.g. to display an object and its parts - or is that just a kind of serialisation to a screen!)

Because programmers don't want to think about 'by-value' or 'by-reference'. (Like we used to have to do when all we had was C++.)

but to convert an object graph into a tree, we have to do one of two things, either

  1. we have a very wide flat tree, the graph is the root, all objects are contained in it, and all properties are references.
  2. Objects themselves construct trees, and the developer has to think about containment - whether the properties are composition or reference.

The use of annotations as suggested, is kind of making the developer think about containment. But I think it is a flawed approach, or at least has significant problems.

  1. Invariably I want to use objects of a class from someone else's library, and I want to serialise them, but they have not added serialisation annotations, and I have no access to the source code. Annotations do not help here. There needs to be an alternative options to add this 'serialisation' or containment information via a secondary source, i.e. a file.
  2. Classes/objects are typically defined such that there are cycles in the property definitions. It is not an option to simply specify that the objects must not form a cycle. There has to be a way of defining whether a property is to be treated as a reference or a composition. If the user/developer does not mind about the shape of the resulting serialisation/tree (i.e. he doesn't mind which objects/properties appear as reference or composite) then the reference/composite choice can be automatically deduced by making an object composite the first time you see it determining if the object already exists in the tree, and making it a reference in that case. However, sometimes it does matter. I am not sure about others, but I certainly do tend to think about composition when I am creating data structures, at least some of the time, and often need to indicate some notion of 'ownership' or tree structure.

I would like to propose/suggest/require the following things,

  1. Provide a solution that does not totally rely on annotations, so I can serialise objects from un-annotated classes.

  2. Consider whether the shape of the tree to be serialised matters. May be two solutions are needed, for when the shape matters and when it does not.

  3. Think about adding language support rather than annotations, and force a developer to chose between composition and reference when defining a property.

I.e for point three I can see different options, depending whether or not backwards compatibility has to be retained at the cost of spoiling a new feature.

a) Introduce new key words for containment by reference, i.e. class Person { var name : String //composite ref partner : Person // reference } (would also need corresponding val/rel or something)

One could then get the compiler to warn (Or error) about by value cycles. Though sometimes these can only be detected at runtime.

b) Have an additional keyword, either compulsory or have a default (composite/reference), maybe different for data classes and 'normal' classes. class Person { composite var name : String reference var partner : Person }

I'm not sure whether a default as composite or reference makes most sense, although reference would give better backwards compatibility, it retains the problem that use of a third party class may mean the original author has not thought about it.

dhakehurst avatar Apr 11 '19 15:04 dhakehurst

@dhakehurst It is possible already to do this. You would have to use a custom encoder. Depending on your implementaiton it may need to be multi-pass. Basically it is an encoder that delegates to another encoder. Basically when writing an element you have 3 options:

  • Write the element as normal (only possible if you use two-pass or another way to determine referencing is not needed)
  • Write the element with added reference (format specific, it could be a container with new ref id and element)
  • Write a reference to an existing referenceid (if not two-pass, this must be already seen/written).

It may be worthwhile to create a set of standard encoders/decoders for this (and other) purposes, but they wouldn't be part of the core serialization library. (Other options would be serialization based equals and hashcode implementations).

pdvrieze avatar Apr 15 '19 13:04 pdvrieze

@dhakehurst I agree that modern programmers don't want to think about 'by-value' or 'by-reference' and they shouldn't think about it – thanks to the GC which can collect cycles and the fact that almost everything in Java(Kotlin) program is a reference. Therefore, it would be harmful to distinguish between 'reference' and 'composite' members.

I disagree about removing annotation support, literally marking every class as serializable. First of all, there are requirements for the class to be serializable. Secondly, classes can encapsulate internal data/state that should no be visible to external clients, or such classes entirely can be an implementation detail. Thirdly, there is a security concern – if a malicious client can get our serialized state, then what?

For using and serializing third-party libraries, there is a concept of external serializer which does not break encapsulation, since it uses only class public API. Such serializers are already supported in the framework.

Regarding the mentioned problem with circular references and building a tree from the object graph: we've intentionally left it out of the scope. Popular formats like JSON do not have a standard format of such references. In some future, we can probably come up with internal kotlin serialization format (like Java Serialization), which can support such references. I believe this is only a matter of correct encoder/decoder, as @pdvrieze suggests.

sandwwraith avatar Apr 15 '19 15:04 sandwwraith

@dhakehurst: I agree with the goal of avoiding annotations. This seems related to schema saving: if a set of SerialDescriptors can itself be serialized and saved (and loaded for future use), then what capability does annotation enable that isn't supported by the schema?

Runtime_usage says: Because protobuf relies on serial ids of fields, called 'tags', you have to provide this information, using serial annotation @SerialId

That seems backwards - a Protobuf (or Thrift, or ASN.1, or Avro, or ... ) schema has tags; they are not unique to Protobuf serialization, and they are also useful for JSON and CBOR. Tags should be an intrinsic part of SerialDescriptor, not a format-specific annotation.

Ordered types (list) have element positions, but unordered types (map, enum) do not. fun getElementIndex(name: String): Int would return a position for ordered types, but would return a tag for unordered types. Tags are useful for things like HTTP status codes where the index isn't a position:

100 "Continue"
200 "OK"
301 "MovedPermanently"
404 "NotFound"

You might want to serialize instances of this type as the index/tag instead of the name even in JSON, and in CBOR you'd always want to serialize simple map keys as tags instead of names.

(Note: you mentioned uni-directional graphs, but minimal spanning trees are for un-directed graphs. The equivalent algorithm for directed graphs is: https://en.wikipedia.org/wiki/Edmonds%27_algorithm.)

@sandwwraith: perhaps I'm missing something, but if a schema is autogenerated from a class, then it is serializable - if a class is not serializable then a schema cannot be autogenerated from it. Eliminating annotation decorators does not require every class to be serializable, but it might require setters that operate on SerialDescriptors, and/or the ability to delete SerialDescriptors for classes for which a schema can be autogenerated but attempts to serialize should for policy reasons throw an error.

davaya avatar May 08 '19 17:05 davaya

There are a lot of good ideas in your proposal, but I don't understand why you would want to rely on annotations on the target classes. You say that your proposal is serialization-format agnostic, but I don't see how (or I don't understand it). As others like @dhakehurst and @davaya pointed out, annotations do not really help us, here. You cannot add @Serializable annotations to existing target classes.

Knowing that a class can be serialized is format-agnostic. It just means that it does not contain any non-transient (and not marked so) field like a database connection. That's why the JVM solution of having a Serializable empty interface, rather than a @Serializable(serializer...) looks a good choice for me. Of course I'm speaking of a core Kotlin interface (maybe with a new name?). Otherwise, you are binding the object with the format, which looks counter-productive to me.

A class is typically unaware of the different formats it can be serialized to or from, so specializations should definitely belong to the serializers themselves. That's why here the JVM choice of the readObject() and writeObject() private methods convention on the target classes doesn't look fine to me. The (de)serialization context is always known, so the concrete (de)serialization operations should be part of it.

Basic data types, along with data classes, would already implement the Serializable interface. Standard containers should also do so, but I did not further investigate if/how the compiler magic would help us detect non-seriability problems at compile time, and the relationship to co/contra-variance (the idea would be to dispense us of the need to declare things like SerializableList<in T: Serializable?> and so forth).

Now for transient fields, I guess we could either have a transient keyword or a @Transient annotation. But if the fields are declared not null, there must be some strategy to populate them.

Also, since extension functions are statically resolved, we would much better have the read(T) and write(T) functions declared as members of the (de)serializers themselves, easing formats inheritance. I think there's no need for readXXX() and writeXXX(), also. A sub-format can define potentiallly specialized methods that will override the generic ones.

arkanovicz avatar Dec 30 '20 09:12 arkanovicz

@arkanovicz, my two cents here.

These two seem to contradict each other:

You cannot add @Serializable annotations to existing target classes.

That's why the JVM solution of having a Serializable empty interface, rather than a @Serializable(serializer...) looks a good choice for me.

The current framework allows you to declare external serializer to library classes which you cannot modify, but the idea with marker interface breaks the possibility to attach serializer entirely.

Knowing that a class can be serialized is format-agnostic. It just means that it does not contain any non-transient (and not marked so) field like a database connection.

No, it doesn't.

  • You don't need to serialize each and every POJO in your codebase, and having serializers generated would just eat your processor time for nothing.
  • It's perfectly fine to serialize classes with transient fields, as it may be just a value computed based on other properties, just cached in a transient field.

Otherwise, you are binding the object with the format, which looks counter-productive to me.

Serializer is not bound to the specific format, it works with the abstraction above all formats. So binding the serializer to the object isn't binding the object to a single format. More like to a specific representation. And you can use contextual serialization to use different representations in different circumstances, so it's also not an issue.

specializations should definitely belong to the serializers themselves.

They do in kotlinx.serialization.

Basic data types, along with data classes, would already implement the Serializable interface.

This is where kotlinx.serialization (and compiler plugins as a whole) shine. You don't need to modify core libraries, including stdlib, to add some aspect like serialization to them. It sounds counter-productive to make primitive types implement this or that when you get to add serialization or any other similar aspect. Type classes would be another way to implement that, but we don't have those in Kotlin as of now.

Standard containers should also do so, [...] (the idea would be to dispense us of the need to declare things like SerializableList<in T: Serializable?> and so forth).

Also present in kotlinx.serialization. Generics are supported unerased thanks to typeOf, which makes this solution more powerful than usual Java-world ones with type tokens, and standard collections have their serializers handy.

r4zzz4k avatar Jan 02 '21 15:01 r4zzz4k

@r4zzz4k Thanks for those clarifications. Is this KEEP still pertinent, then?

arkanovicz avatar Jan 02 '21 16:01 arkanovicz

@arkanovicz I'm not sure, let's wait for the clarification by @elizarov.

Regardless of that, please check quite extensive kotlinx.serialization guide to learn more about the current state of implementation of this proposal. Any questions on that can be posted at the usual channels: Kotlin Forum, #serialization channel on Kotlin Slack (join here if you're going to visit Kotlin Slack for the first time).

r4zzz4k avatar Jan 02 '21 16:01 r4zzz4k

@arkanovicz This KEEP should match the 1.0 implementation of kotlinx.serialization framework. Can you spot any contradictions?

sandwwraith avatar Jan 18 '21 12:01 sandwwraith