jackson-dataformats-binary icon indicating copy to clipboard operation
jackson-dataformats-binary copied to clipboard

[avro] add Stringable class method to SchemaHelper

Open trompa opened this issue 5 years ago • 7 comments

I have a POJO with a Map<Integer, String> that cannot be serialized because Integer is not marked as Stringable.

There is already a @Stringable annotation, but that wont work for Integer wrapper.

There are 2 posible solutions:

trompa avatar Aug 08 '18 14:08 trompa

Add Integer to Stringable classes. AvroSchemaHelper.java holds a list of Stringable classes. Just add it there. As Integer complies with the needed interface to be stringable ( constructor from String, toString method ).

This is specifically the classes considered by the Apache Avro implementation to be stringable. It should stay in sync with the apache implementation for compatibility reasons. Verify how the apache implementation handles Integer map keys, as that is what would need to be followed.

baharclerode avatar Aug 17 '18 15:08 baharclerode

I just verified - Integer keys are not handled as maps in Avro-- they follow the "Maps with non-stringable keys" pattern, which are encoded as an array of records with key-value fields:

{
  "type" : "array",
  "items" : {
    "type" : "record",
    "name" : "PairIntegerString",
    "namespace" : "org.apache.avro.reflect",
    "fields" : [ {
      "name" : "key",
      "type" : "int"
    }, {
      "name" : "value",
      "type" : "string"
    } ]
  },
  "java-class" : "java.util.Map"
}

A) Maps with non-stringable keys are not supported in Jackson currently B) Even the apache implementation handles this incorrectly, because it chokes on the "java-class": "java.util.Map" when trying to deserialize.

TL;DR:

  • Treating integer keys as stringable is incompatible with how the reference implementation handles integer keys.
  • Due to bugs in deserialization, the Avro reference implementation can't handle Maps with integer keys.

baharclerode avatar Aug 17 '18 15:08 baharclerode

@baharclerode Thank you for additional investigation here.

I am bit torn here. I agree that since Stringable, as a concept, is from Apache Avro codec, it makes sense to follow similar limits. At the same time, Jackson databind itself is quite flexible and should be able to allow coercions for cases where Java POJO representation differs from Avro schema.

As to using array of pairs approach... while I can see benefits, I am bit wary of adding support mostly because this would be format-specific deviation. At the same time something similar may be necessary for protobuf module as well. So perhaps this could be the way...

Now: when using array-of-pairs approach, Avro schema is different from "straight Map" case. So maybe these approaches are not conflicting: both could possibly be supported. Map-to/from-array would require support for Schema generation, but bigger problem is probably that of how to either translate things on the fly (to look like Map, wrt JsonToken stream), or, if necessary, to expose alternate structure requirement for databind (which must be aware of alternate shape).

cowtowncoder avatar Aug 17 '18 18:08 cowtowncoder

At the same time, Jackson databind itself is quite flexible and should be able to allow coercions for cases where Java POJO representation differs from Avro schema.

Agree 100%-- If I have an existing Map<String,T> schema, and I try to serialize a Map<Integer,T> with that schema, it would be good behavior to stringify the integer; Jackson databind already does that I believe.

I think the problem here is more about generating a schema for Map<Integer,T>, which Jackson doesn't support today (Probably isn't hard, just isn't implemented), and the Apache implementation will use array-of-pairs format. IMHO we should keep schema generation strictly compatible with the Apache implementation, but make the databinder understand the array-of-pairs format when it sees it in the schema and adapt the token stream appropriately so that it can handle both cases.

baharclerode avatar Aug 17 '18 19:08 baharclerode

Adding Integer.class to ReflectData - Stringable list with the addStringable method makes avro generate the map as if Integer was an stringable key:

{
  "type" : "record",
  "name" : "POJO2",
  "fields" : [ {
    "name" : "map1",
    "type" : {
      "type" : "map",
      "values" : "string",
      "java-key-class" : "java.lang.Integer"
    }
  } ]
}

They even use it on their tests: https://github.com/apache/avro/blob/master/lang/java/avro/src/test/java/org/apache/avro/reflect/TestReflect.java#L995

That was what i was trying to mimic ... I wasn't aware of the bug on serialising the integer. It was weird, tho, that BigInteger was included and Integer wasnt (Is there any issue open in Avro tracker? cannot find it ) We just realised that it is, in fact, creating a json with non string keys: {"offers": {0: {"blah":"blah"}, 1: {"blah":"blah"}}} instead of {"offers": {"0": {"blah":"blah"}, "1": {"blah":"blah"}}}

trompa avatar Aug 21 '18 08:08 trompa

I´ve done some tests and same behaviour is seen with BigDecimal and BigInteger, I am getting number keys and no strings.

Checking the issue that included the Stringable classes set, the reason to not include Integer to this set is:

Note that I'm adding Integer.class to stringables on ReflectData in the TestReflect unit test and I did pause for a minute to think about implications of adding it as the stringableClasses set is shared for both map keys as well as straight types. Since Integer.class is already bucketed as a primitive before we check for stringableClasses in a non-map scenario, it should be fine though

As for adding Integer.class as a stringable, it would be safest to do that with something like: ReflectData data = new ReflectData().addStringable(Integer.class); That way it wouldn't alter the default instance.

https://issues.apache.org/jira/browse/AVRO-1147?focusedCommentId=13451255&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13451255

If I am not wrong, the only place where this library is checking for stringable is on the map-scenario. So it should be safe to include Integer, or, at least, allow to include it with the method i added in my PR.

trompa avatar Aug 21 '18 12:08 trompa

(Is there any issue open in Avro tracker? cannot find it )

There are a lot of bugs with the reference avro implementation, unfortunately, and I do not recommend their databinding implementation as an example of a good databinder (hence, why we're here with Jackson ;) ).

We just realised that it is, in fact, creating a json with non string keys: {"offers": {0: {"blah":"blah"}, 1: {"blah":"blah"}}}

Strictly speaking, that JSON blob is invalid and not JSON (JSON keys must be strings-- see my previous point)

It was weird, tho, that BigInteger was included and Integer wasnt

BigInteger/BigDecimal cannot fit into any avro datatype except for string because it is variable precision and larger than any of the standard numeric datatypes-- I suspect making them stringable was done out of necessity.

So it should be safe to include Integer, or, at least, allow to include it with the method i added in my PR.

If added, it should be an optional configuration property or feature of an individual mapper. It should not affect all mappers, and it should not be the default behavior.

baharclerode avatar Aug 21 '18 16:08 baharclerode