avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase.

Open rumeshkrish opened this issue 5 years ago • 6 comments

Jira

This is new feature in JAVA component to normalise the Avro schema using canonical order and get the avro schema with reserved properties. The canonical form of schema should follow the below rules,

  • Canonical normaliser should filter and order the reserved as well as user given properties. User able to normalise schema with additional user defined logical types. reserved avro property ordering as below, followed by user given properties.
  • JIRA issue : https://issues.apache.org/jira/browse/AVRO-2299
  • No additional dependency changes.
  • Normalisation new feature for Java only.

Tests

  • New unit test included in the file lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java

Documentation

  • Updated spec.xml
  • Updated java documentation.

Reference discussion:

  • https://github.com/apache/avro/pull/452
  • https://issues.apache.org/jira/browse/AVRO-2299

Sending to Review : @cutting , @tjwp , @Fokko

rumeshkrish avatar Feb 07 '20 14:02 rumeshkrish

@rumeshkrish I stumbled across AVRO-2299 when I realised that parsing canonical form strips too many properties from the schema to be useful for my use-case. I'm interested to try standard canonical form, so I pulled in these changes and tried generating toCanonicalForm on a few schemas. Overall everything worked as expected, but I did hit a couple of problems:

  1. It looks like these changes introduce duplicate classes called TestStandardCanonicalSchema and TestCustomCanonicalSchema into lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java. Eyeballing these duplicates they appear to be identical to each other - so I just deleted one copy of each, and that seemed to resolve the problem.

  2. Generating the standard canonical form for an enum schema doesn't appear to preserve the default property as I had expected. Here's the test code I wrote:

import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaNormalization;

public class App {
    public static void main(String[] args) throws IOException {
        Schema.Parser parser = new Schema.Parser();
        Schema schema = parser.parse(new File("./test.avro"));
        System.out.println(SchemaNormalization.toCanonicalForm(schema));
    }
}

And here's the contents of ./test.avro:

{
  "type": "enum",
  "name": "Suit",
  "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"],
  "default": "HEARTS"
}

This generates the following output, which does not include the default:

{"name":"Suit","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"]}

It looks like the arm of the code that handles ENUM's is missing a check to see if the enum has a default value. By adding the following code to the end of this block, I was able to generate a default property in the output:

if (s.getEnumDefault() != null) {
    o.append(",\"default\":").append(JacksonUtils.toJsonNode(s.getEnumDefault()).toString());
}

prestona avatar Mar 03 '20 23:03 prestona

Having played a little more, there also appears to be a discrepancy between the specification for standard canonical form and the handling of the decimal logical type. If I generate the standard canonical form for this schema:

{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Then I get this:

{"type":"bytes""logicalType":"decimal","precision":4,"scale":2}

The most obvious problem is that this is not valid JSON. However, if I fix this then there is still the problem that the spec. does not include precision or scale in the list of preserved attributes.

In this case, I think the current implementation is correct, and the specification needs to be amended. This is because stripping the precision attribute results in an invalid logical type (as it is a required field). Stripping the scale attribute could change the way the schema would de-serialize data as if this attribute is omitted then a default of 0 is assumed.

Experimenting with decimal logical types that are based on a fixed type also has some inconsistencies. Specifically, if I generate the standard canonical form for this schema:

{
  "type": "fixed",
  "size": 16,
  "name": "x",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Then I get this:

{"name":"x","type":"fixed","size":16}

This has discarded the logicalType attribute (which the definition of standard canonical form indicates should be preserved) as well as the other attributes of decimal which, arguably, should be present to avoid changing the interpretation of the schema.

prestona avatar Mar 04 '20 00:03 prestona

@prestona Thanks for your time to do some more testing. I have update the specification with following test cases.

<<INPUT {"namespace":"x.y.z", "type":"enum", "name":"foo", "doc":"foo bar", "symbols":["SPADES", "HEARTS", "DIAMONDS", "CLUBS"], "default": "HEARTS"}
<<canonical {"name":"x.y.z.foo","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"],"doc":"foo bar","default":"HEARTS"}

// 035
<<INPUT {"type":"bytes", "logicalType":"decimal", "precision": 4, "scale": 2}
<<canonical {"type":"bytes","logicalType":"decimal","precision":4,"scale":2}

// 036
<<INPUT {"type": "fixed", "size": 16, "name": "x", "logicalType": "decimal", "precision": 4, "scale": 2}
<<canonical {"name":"x","type":"fixed","size":16,"logicalType":"decimal","precision":4,"scale":2}

rumeshkrish avatar Jun 14 '20 09:06 rumeshkrish

@cutting @iemejia @Fokko It would be great to have this feature as part of 1.10.0 release. Kindly review this changes, if any support for fixing the bug or refactor, i am happy to contribute.

rumeshkrish avatar Jun 14 '20 09:06 rumeshkrish

Hello! My apologies for the late, late reply -- I haven't been following this issue very well! I guess my big question is whether this really needs to be part of the specification and SchemaNormalization.

There are lots of reasons we might want useful, custom code that do schema transformations, including customized normalizations. Stripping custom user properties and "non-essential" metadata is certainly one of those! But is the specification necessary or useful for portability between languages or versions?

What do you think about just adding the code and tools to do "custom normalizations" into the Java SDK and others without defining a new canonical form?

RyanSkraba avatar Jun 15 '20 10:06 RyanSkraba

@RyanSkraba It would be great, if we could add Standard Normalizations into other languages or versions.

rumeshkrish avatar Jun 15 '20 15:06 rumeshkrish