ion-cli Anonymous types must have locally-defined names in order to be stable

Suppose I have the following schema:

$ion_schema_2_0

type::{
  name: Foo,
  fields: {
    a: { fields: { x: int, y: int } },
    b: int,
  }
}

type::{
  name: Bar,
  type: list,
  element: { fields: { u: float, v: float } }
}

As it is currently implemented, the code generator might generate something like this. (I'm using Kotlin, rather than Java, for conciseness here, but the principle will still apply.)

class Foo(val a: AnonymousType1, val b: Int)

class AnonymousType1(val x: Int, val y: Int)

class Bar(val value: List<AnonymousType2>)

class AnonymousType2(val u: Double, val v: Double)

Now, suppose I add a new field to Foo, like this:

type::{
  name: Foo,
  fields: {
    a: { fields: { x: int, y: int } },
    b: int,
    c: { fields: { id: string, value: int },
  }
}

The code generator has no way of knowing how to keep the generated code stable, and would likely end up generating these classes:

class Foo(val a: AnonymousType1, val b: Int, val c: AnonymousType2)

class AnonymousType1(val x: Int, val y: Int)
class AnonymousType2(val id: String, val value: Int)

class Bar(val value: List<AnonymousType3>)

class AnonymousType3(val u: Double, val v: Double)

In this scenario, AnonymousType2 and Bar have been redefined in a way that will break the code of anyone who is using either one of these classes in any non-trivial way. This cannot be solved by sorting the anonymous types because the only sort attribute that would not result in types being moved/redefined would be the timestamp that the inline type was added to the schema for the first time.

There are a few things that I think we must do in order to fix this problem, but I'm certainly open to hearing other solutions.

All inline types must be placed inside a namespace corresponding to the parent type. In Java, that would mean e.g. org.example.AnonymousType1 would instead be org.example.Foo.AnonymousType1. In Rust, it's a little trickier because you could end up with naming clashes where you create a module called foo to hold children of Foo when there's already another module you're generating called foo, but this is solvable.
Inline types, if they have a $code_gen_name field, should use that name; otherwise...

Inline types in a fields constraint should be named after their respective fields. I.e.AnonymousType1 would be named Foo.A in Java/Kotlin.
Inline types in an element constraint should be named Element. I.e. AnonymousType2 would be named Bar.Element in Java/Kotlin.
Inline types in type or all_of should have their constraints flattened into the containing type.
Inline types in one_of constraint have two possible solutions
- They should either be required to have a code-gen name (so that we can add a variant-specific annotation) OR
- They should be named Variant0..VariantN, and every one_of constraint must have its own distinct counter (i.e. not a globally incremented counter). E.g. Car.Variant0, Car.Variant1, etc.
Inline types in ordered_elements should be named Element0..ElementN, and every ordered_elements constraint must have its own distinct counter.

I might be missing a case somewhere, but I think that mostly covers them all.

Apr 18 '24 23:04 popematt

As per an offline discussion, all the inline type definitions must have a name defined for the initial implementation. This simplifies later adding support to infer names based on above suggested option 2(e.g. field names). Also, as part of a 1.0 release of code generation we must add a $code_gen_name field to allow user to specify a name for inline types.

May 09 '24 23:05 desaikd

There can be few alternatives to $codegen_name in order to extend it in future to hold more details about the inline type like namespace, different names in different programming languages etc. Writing it out here for consideration,

Option 1: Use an Ion struct

We can use an Ion struct here which can hold more information about the name of inline type, like different names in different programming languages, a namespace to consider for when there is a collision of same inline names possible etc. e.g.

type:: {
  name: uses_inline_type,
  type: { 
       type: int, 
       $isl_codegen: {    
          java: {
              name: FooType,
          },
          rust: {
             name: Foo,
          }
          ...
      }
  }
}

Option 2: Use annotations

We can use annotations for a compact syntax then the struct in Option 1. e.g.

type:: {
  name: uses_inline_type,
  type: { 
       type: int, 
        $isl_codegen: [
          java::name::FooType,
          rust::name::Foo,
        ],
        ...
  }
}

If we do decide to stick with $codegen_name then we can add more constraints like $codegen_name in future and on the implementation side require all/some of them being present based on usage. e.g.

type:: {
  name: uses_inline_type,
  type: { 
       type: int, 
       $codegen_name_java: FooType,
       $codegen_name_rust: Foo,
       ...
  }
}

cc: @zslayton

May 16 '24 20:05 desaikd

I did some thinking about this a while ago, here is what I had come up with...

In theory, all code gen properties are one of (click on them for examples):

Language-agnostic fields that just accept the data as the field value without any language specific cases

type::{
  name: foo,
  $codegen: { ignore: true }
}

Language-specific fields that accept a struct with language-value pairs

type::{
  name: special_type,
  $codegen: {
    native_type: {
      rust: "my_crate::types::MyType",
      kotlin: "org.example.MyType",
    }
}

Language-optional fields that can go either way and accept a language agnostic value and/or language-value pairs.

type::{
  name: foo,
  $codegen: { name: foo_object }
}

type::{
  name: bar,
  $codegen: { 
    name: {
      kotlin: bar_class,
      rust: bar_struct,
      '*': bar_type, // Optional default value for other languages
    }
  }
}

It’s a backwards compatible change for a codegen config property to go from language-agnostic to language-optional or to go from language-specific to language-optional, but it’s breaking to change a field from language-optional to anything else. Therefore, we should try to make fields language-agnostic or language-specific until we have a compelling case to make it language-optional.

This pattern is especially useful because it's easy for a codegen tool to find the right values. First, find the $codegen field. Then, if you need to access a specific value, such as name, you get the name field. If the value is not a struct, then you've got the value. If it is a struct, you use the language name as the field to get the actual value.

Whatever we do, the configuration hierarchy should be codegen signifier -> property name -> language specifier so that we can avoid repetition for language-agnostic properties, and so that we can make it possible to have language specific values and a fallback value.

We could choose from many different ways to structure the keys. One alternative to the above proposal is a flatter set of keys, like this:

type::{
  name: foo,
  $codegen_name: { java: FooClass, rust: FooStruct },
  $codegen_type_hint: tuple,
}

^^ That was from some of my notes from nearly 18 months ago.

I don't think there's too much difference between different structures, but I would suggest that we don't rely on constructing field names to form something like $codegen_name_java. Having to build strings in order to create the right field names at runtime seems like a bit of an anti-pattern, and if we e.g. want to find all of the java related fields, we need to see all the fields and check for partial matches that include "java". Let's make the IonReader do that parsing work for us.

I would also suggest that we don't use a list of annotated values. I'd rather we be opinionated about a structure for the property values and force values that are similar in some way to be grouped together. I also think that the double annotations are going to be unwieldy to work with and unintuitive for users.

Single annotations could be okay. (Click to expand examples)

I don't love it, but I'm also not opposed to something like either of these:

type::{
  name: foo,
  $codegen_name: [
    java::kotlin::FooClass, // Applies to both Java and Kotlin
    rust::FooStruct,
    FooType,         // Language not specified, so this is the default value for other languages
  ]
  $codegen_type_hint: tuple,
}

type::{
  name: foo,
  $codegen: {
    name: [
      java::FooClass, 
      rust::FooStruct,
    ],
    type_hint: tuple,
  }
}

May 16 '24 22:05 popematt

Thanks for the detailed explanation Matthew! I think from your examples, language-specific sounds like the best solution to start with and infact having that specific syntax rather then the flatten one will make it much easier for user to specify everything in one place/one struct, rather then having different properties(e.g. $codegen_name, $codegen_type_hint). I also realized from your examples that this $codegen struct can be used for all inline types as well as top level types(e.g. to use something like type_hint which is useful for both top level and inline types). So following example from your ideas on language-specific syntax, sounds like the best approach to me:

type::{
  name: foo,
  $codegen: {
    name: {
      java: "FooClass",
      rust: "FooStruct",
    }
    type_hint: {
       java: class,
       rust: struct
    }
}

May 16 '24 22:05 desaikd