jackson-dataformats-binary icon indicating copy to clipboard operation
jackson-dataformats-binary copied to clipboard

[avro] java.io.IOException: Invalid Union index (-40); union only has 2 types

Open vicenteg opened this issue 6 years ago • 5 comments

Unsure if I'm doing something wrong here. I want to deserialize Avro to a Json string.

I've boiled my issue down to the following:

  public static void main(String[] args) {
    String inputFile = "test.avro";
    MappingIterator<JsonNode> it = null;

    try {
      Schema jsonSchema =
          new Schema.Parser().setValidate(true).parse(new File(inputFile + ".schema"));
      AvroSchema schema = new AvroSchema(jsonSchema);

      AvroMapper avroMapper = new AvroMapper();
      avroMapper.schemaFrom(new File(inputFile + ".schema"));
      it = avroMapper.readerFor(JsonNode.class).with(schema).readValues(new FileInputStream(inputFile));
    } catch (IOException ex) {
      System.err.println("Could not open " + inputFile + " : " + ex.getMessage());
      System.exit(1);
    }

    while (it.hasNext()) {
      JsonNode row = it.next();
      System.out.println(row);
    }
  }

I get an exception:

Exception in thread "main" java.lang.RuntimeException: Invalid Union index (-40); union only has 2 types
        at com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:196)
        at test.AvroReadToJsonNode.main(AvroReadToJsonNode.java:33)
Caused by: java.io.IOException: Invalid Union index (-40); union only has 2 types
        at com.fasterxml.jackson.dataformat.avro.deser.ScalarDecoder$ScalarUnionDecoder$FR._checkIndex(ScalarDecoder.java:422)
        at com.fasterxml.jackson.dataformat.avro.deser.ScalarDecoder$ScalarUnionDecoder$FR.readValue(ScalarDecoder.java:412)
        at com.fasterxml.jackson.dataformat.avro.deser.RecordReader$Std.nextToken(RecordReader.java:134)
        at com.fasterxml.jackson.dataformat.avro.deser.AvroParserImpl.nextToken(AvroParserImpl.java:98)
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:249)
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68)
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
        at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:277)
        at com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:192)

The schema looks like this:

{
  "type" : "record",
  "name" : "test",
  "namespace" : "test.test.avro",
  "doc" : "",
  "fields" : [ {
    "name" : "some_string",
    "type" : [ "null", "string"]
  } ]
}

And I generated data from the schema using avrotools:

avrotools random --schema-file test.avro.schema --count 100 test.avro

vicenteg avatar Nov 27 '17 16:11 vicenteg

And this is with which Jackson version?

cowtowncoder avatar Nov 28 '17 00:11 cowtowncoder

2.9.2

        <dependency>
            <groupId>com.fasterxml.jackson.dataformat</groupId>
            <artifactId>jackson-dataformat-avro</artifactId>
            <version>2.9.2</version>
        </dependency>

vicenteg avatar Nov 28 '17 16:11 vicenteg

Ok. So reproduction is almost complete, one missing piece being the encoded input file. I think that is needed as presumably module would not write such content.

I am guessing this might be due to one unfortunate design by Avro authors, however... format is different when stored in a file compared to when encoded for transmission. If so, it will start with a marker and schema as json. Given lack of any metadata in encoding, this is not possible to reliably auto-detect; and it seems strange to require codecs to be aware of input source. At the moment this module does not have special handling for this prefix, although I think there is an issue for requesting implementation.

It should be relatively easy to check if input might be of this form: Avro specification outlines how the headers looks like:

https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files

I think this is one of badly designed bad of specification and wonder what authors were smoking. But it is what it is.

cowtowncoder avatar Nov 28 '17 18:11 cowtowncoder

For the encoded input file, you can use avrotools random to generate some data. I used a command line like the following:

avrotools random --schema-file test.avro.schema --count 100 test.avro

Here's a link to a sample file: https://storage.googleapis.com/vincegonzalez/jackson-dataformats-binary-issue-123.avro

vicenteg avatar Nov 29 '17 19:11 vicenteg

Yes, that does start with Obj signature indicating Object Container addition, with signature followed by JSON-encoded embedded schema.

So as things are, Object Container files are not supported, only raw encoded content. Issue #8 is about adding support for handling this case (both reading and writing).

cowtowncoder avatar Dec 03 '17 23:12 cowtowncoder