avro
avro copied to clipboard
AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase.
Jira
This is new feature in JAVA component to normalise the Avro schema using canonical order and get the avro schema with reserved properties. The canonical form of schema should follow the below rules,
- Canonical normaliser should filter and order the reserved as well as user given properties. User able to normalise schema with additional user defined logical types. reserved avro property ordering as below, followed by user given properties.
- JIRA issue : https://issues.apache.org/jira/browse/AVRO-2299
- No additional dependency changes.
- Normalisation new feature for Java only.
Tests
- New unit test included in the file lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java
Documentation
- Updated spec.xml
- Updated java documentation.
Reference discussion:
- https://github.com/apache/avro/pull/452
- https://issues.apache.org/jira/browse/AVRO-2299
Sending to Review : @cutting , @tjwp , @Fokko
@rumeshkrish I stumbled across AVRO-2299 when I realised that parsing canonical form strips too many properties from the schema to be useful for my use-case. I'm interested to try standard canonical form, so I pulled in these changes and tried generating toCanonicalForm
on a few schemas. Overall everything worked as expected, but I did hit a couple of problems:
-
It looks like these changes introduce duplicate classes called
TestStandardCanonicalSchema
andTestCustomCanonicalSchema
intolang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java
. Eyeballing these duplicates they appear to be identical to each other - so I just deleted one copy of each, and that seemed to resolve the problem. -
Generating the standard canonical form for an
enum
schema doesn't appear to preserve thedefault
property as I had expected. Here's the test code I wrote:
import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaNormalization;
public class App {
public static void main(String[] args) throws IOException {
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(new File("./test.avro"));
System.out.println(SchemaNormalization.toCanonicalForm(schema));
}
}
And here's the contents of ./test.avro
:
{
"type": "enum",
"name": "Suit",
"symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"],
"default": "HEARTS"
}
This generates the following output, which does not include the default:
{"name":"Suit","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"]}
It looks like the arm of the code that handles ENUM's is missing a check to see if the enum
has a default value. By adding the following code to the end of this block, I was able to generate a default
property in the output:
if (s.getEnumDefault() != null) {
o.append(",\"default\":").append(JacksonUtils.toJsonNode(s.getEnumDefault()).toString());
}
Having played a little more, there also appears to be a discrepancy between the specification for standard canonical form and the handling of the decimal
logical type. If I generate the standard canonical form for this schema:
{
"type": "bytes",
"logicalType": "decimal",
"precision": 4,
"scale": 2
}
Then I get this:
{"type":"bytes""logicalType":"decimal","precision":4,"scale":2}
The most obvious problem is that this is not valid JSON. However, if I fix this then there is still the problem that the spec. does not include precision
or scale
in the list of preserved attributes.
In this case, I think the current implementation is correct, and the specification needs to be amended. This is because stripping the precision
attribute results in an invalid logical type (as it is a required field). Stripping the scale
attribute could change the way the schema would de-serialize data as if this attribute is omitted then a default of 0
is assumed.
Experimenting with decimal
logical types that are based on a fixed
type also has some inconsistencies. Specifically, if I generate the standard canonical form for this schema:
{
"type": "fixed",
"size": 16,
"name": "x",
"logicalType": "decimal",
"precision": 4,
"scale": 2
}
Then I get this:
{"name":"x","type":"fixed","size":16}
This has discarded the logicalType
attribute (which the definition of standard canonical form indicates should be preserved) as well as the other attributes of decimal
which, arguably, should be present to avoid changing the interpretation of the schema.
@prestona Thanks for your time to do some more testing. I have update the specification with following test cases.
<<INPUT {"namespace":"x.y.z", "type":"enum", "name":"foo", "doc":"foo bar", "symbols":["SPADES", "HEARTS", "DIAMONDS", "CLUBS"], "default": "HEARTS"}
<<canonical {"name":"x.y.z.foo","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"],"doc":"foo bar","default":"HEARTS"}
// 035
<<INPUT {"type":"bytes", "logicalType":"decimal", "precision": 4, "scale": 2}
<<canonical {"type":"bytes","logicalType":"decimal","precision":4,"scale":2}
// 036
<<INPUT {"type": "fixed", "size": 16, "name": "x", "logicalType": "decimal", "precision": 4, "scale": 2}
<<canonical {"name":"x","type":"fixed","size":16,"logicalType":"decimal","precision":4,"scale":2}
@cutting @iemejia @Fokko It would be great to have this feature as part of 1.10.0 release. Kindly review this changes, if any support for fixing the bug or refactor, i am happy to contribute.
Hello! My apologies for the late, late reply -- I haven't been following this issue very well! I guess my big question is whether this really needs to be part of the specification and SchemaNormalization.
There are lots of reasons we might want useful, custom code that do schema transformations, including customized normalizations. Stripping custom user properties and "non-essential" metadata is certainly one of those! But is the specification necessary or useful for portability between languages or versions?
What do you think about just adding the code and tools to do "custom normalizations" into the Java SDK and others without defining a new canonical form?
@RyanSkraba It would be great, if we could add Standard Normalizations into other languages or versions.