parquet-java
parquet-java copied to clipboard
allow read old parquet file which is maked by old api with old avro version which allow wrong default value in schema
Please consider if user call this api to write a parquet file with avro schema string in file metadata:
ParquetWriter<GenericRecord> writer = AvroParquetWriter
.<GenericRecord> builder(tempFile)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withSchema(avroSchema)
.build();
that "avroSchema" have wrong default value(as before avro 1.9, avro api not validate default value when construct avro schema).
Now user use new parquet-avro jar with avro 1.11.x, and do read action to read that old parquet file with avro schema string in file metadata, then will check the default value by parse avro schema in parquet AvroReadSupport, then issue will appear. That is why do this change here.
exception stack is this:
(org.apache.avro.AvroTypeException) Invalid default for field id: "" not a ["int","null"]
at org.apache.avro.Schema.validateDefault(Schema.java:1636) ~[avro-1.11.2.jar:1.11.2]
at org.apache.avro.Schema.access$500(Schema.java:94) ~[avro-1.11.2.jar:1.11.2]
at org.apache.avro.Schema$Field.<init>(Schema.java:561) ~[avro-1.11.2.jar:1.11.2]
at org.apache.avro.Schema.parse(Schema.java:1747) ~[avro-1.11.2.jar:1.11.2]
at org.apache.avro.Schema$Parser.parse(Schema.java:1472) ~[avro-1.11.2.jar:1.11.2]
at org.apache.avro.Schema$Parser.parse(Schema.java:1459) ~[avro-1.11.2.jar:1.11.2]
at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124) ~[parquet-avro-1.10.1.jar:1.10.1]
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183) ~[parquet-hadoop-1.10.1.jar:1.10.1]
at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) ~[parquet-hadoop-1.10.1.jar:1.10.1]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) ~[parquet-hadoop-1.10.1.jar:1.10.1]
Thanks @wwang-talend for opening the PR! Could you create a JIRA issue for this?