avro
avro copied to clipboard
AVRO-2830: Union with aliases for java
When evolving a union over time allow to merge multiple records into one using aliases.
So a writer with the following schema:
[ {
"type" : "record",
"name" : "A",
"fields" : [ { "name" : "a", "type" : "boolean" } ]
}, {
"type" : "record",
"name" : "B",
"fields" : [ { "name" : "b", "type" : "boolean" } ]
} ]
can be read by a reader with schema:
[ {
"type" : "record",
"name" : "B",
"aliases" : [ "A" ],
"fields" : [
{ "name" : "a", "type" : "boolean", "default" : true},
{ "name" : "b", "type" : "boolean", "default" : true}
]
} ]
Make sure you have checked all steps below.
Jira
- [x] My PR addresses the following AVRO-2830 issues and references them in the PR title. For example, "AVRO-1234: My Avro PR"
Tests
- [X] My PR extends unite test org.apache.avro.TestSchemaCompatibility#testReaderWriterDecodingCompatibility
Commits
- [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
- Subject is separated from body by a blank line
- Subject is limited to 50 characters (not including Jira issue reference)
- Subject does not end with a period
- Subject uses the imperative mood ("add", not "adding")
- Body wraps at 72 characters
- Body explains "what" and "why", not "how"
Documentation
- [x] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does
Hi @steven-aerts ,
Thank you for your time. For this issue, I investigated the relevant specifications. The specifications of union are:
fail with two branches of the same unnamed type
succeed with two branches of the same named type, if different names
If you use the same named type, the union type of two branches with different names, and then use the alias to associate the name, this will rewrite the writer's schema with the alias from the reader's schema,
Rewrite a writer's schema using the aliases from a reader's schema. Thispermits reading records,
enums and fixed schemas whose names have changed,and records whose field names have changed. The returned schema alwayscontains the same data elements in the same order, but with
possiblydifferent names.
so that the two branches will be identical, so it violates Failed due to union specifications.
If you want to support AVRO-1347 a similar solution, I think the specification should be redefined first. Because this modification will ignore all union+ aliases checks, this is a violation of the specificatio.
Hi @zeshuai007 ,
If I understand you well, you are saying that in the example given above. The reader schema and the writer schema are compliant with the specification.
But the intermediate schema, where the writer schema is rewritten to match the reader schema, is not. I was not aware that this needed to be the case.
I also think there are examples which currently work which also generate an invalid intermediate schema.
For example, take the following writer schema: [int, float]
and read it with the reader schema [float]
.
Then you also see that the intermediate schema has two fields of the same type which is not allowed for a normal schema, but it does and should exist in the intermediate schema.
So this is why I thought we could do it like that, as I saw the above logic as a precedence.
But maybe I am missing something?
Thanks,
Steven