AVRO-4135: [C] full type name for json encoded unions
What is the purpose of the change
Like java use the full type name when encoding the types for unions.
This to prevent ambiguities when decoding this json encoded field back to avro using for example the java library.
Without this patch you cannot use the Java JsonDecoder to decode union with records having a namespace entry generated by the avro C library.
Verifying this change
This change added tests and can be verified by running the test_avro_4135
Documentation
- Does this pull request introduce a new feature? no
Can you show an example of the ambiguous JSON data and the Avro schema with which was encoded?
https://avro.apache.org/docs/1.12.0/specification/#json-encoding stipulates:
otherwise it is encoded as a JSON object with one name/value pair whose name is the type’s name and whose value is the recursively encoded value. For Avro’s named types (record, fixed or enum) the user-specified name is used, for other types the type name is used.
That specifically says "name" rather than "fullname". If the JSON encoders now should output the fullname of the type as the property name, then the specification should be changed.
This likewise says "name" rather than "fullname":
Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum. For example, unions containing two array types or two map types are not permitted, but two types with different names are permitted. (Names permit efficient resolution when reading and writing unions.)
Moreover, the schema resolution part of the specification says that schemas match if:
both schemas are records with the same (unqualified) name
Suppose the specification were changed to allow multiple identically named record schemas as members of the same union, as long as they are in different namespaces. Then, if such a union schema is the reader's schema, then both record schemas can match the writer's schema, in which case the first matching schema must be used in schema resolution — even if the other record schema has the same namespace as in the writer's schema. That seems an undesirable result to me.
@KalleOlaviNiemitalo the ambiguity you describe is contained in AVRO-2287.
What we are seeing is that the C library goes for the name interpretation and the java library goes for the fullname description.
To reproduce this with an example, I used the schema at the bottom of this message.
To encode:
-
avro_value_to_jsonwill output{"field": {"r2": {}}}. - While the
JsonEncoderoutputs{"field": {"space.r2": {}}}.
On the decoding side:
C does not have a implementation for decoding the json encoding.
However when you try with java to to read {"field": {"r2": {}}} with the JsonDecoder for this schema you get:
Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch r2
at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:434)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:282)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
Which shows the implementation differences for both languages. And java is clearly expecting a fully qualified path for unions.
This patch proposes to line up the behavior for C with that of Java.
Would you prefer the other way around? Are you willing to accept a patch on the Java side which allows reading the C encoding?
Example schema:
{
"type": "record",
"name": "r",
"fields": [
{
"name": "field",
"type": [
"null",
{
"type": "record",
"name": "r2",
"namespace": "space",
"fields": []
}
]
}
]
}
Java code use to generate exception:
var decoder = DecoderFactory.get().jsonDecoder(schema, "{\"field\": {\"r2\": {}}}");
var data = new GenericDatumReader(schema).read(null, decoder);
@KalleOlaviNiemitalo meanwhile I checked all the other language bindings on how they interpret the spec.
All the language bindings which have a JSON encoder seem to fully qualify their types. It is only the C binding which deviates here. So this patch lines up this behavior with the other language bindings.
| json union encoding | json union decoding | |
|---|---|---|
| C | unqualified | N/A |
| C++ | qualified | qualified |
| csharp | qualified [1] | qualified |
| java | qualified | qualified |
| js | qualified | qualified and unqualified [2] |
| perl | N/A | N/A |
| PHP | N/A | N/A |
| python | N/A | N/A |
| ruby | N/A | N/A |
| rust | N/A? | N/A? |
[1]: possible to configure a non standard encoding where the union value is not wrapped in an object. This variant is not decodable. [2]: where unqualified is commented in code as being less strict. I think this is done, as it is part of the standard flow to encode any javascript object to (binary) avro, giving developers who encode unions a little bit less to type.
Again, if you think decoders should, like the js decoder, be more lenient I do not mind to workout a patch for the Java bindings.
@KalleOlaviNiemitalo in #3374 I propose to add support for non qualified type resolution in the java JsonDecoder.