parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

allow read old parquet file which is maked by old api with old avro version which allow wrong default value in schema

Open wwang-talend opened this issue 2 years ago • 1 comments

Please consider if user call this api to write a parquet file with avro schema string in file metadata:

ParquetWriter<GenericRecord> writer = AvroParquetWriter
                    .<GenericRecord> builder(tempFile)
                    .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                    .withSchema(avroSchema)
                    .build();

that "avroSchema" have wrong default value(as before avro 1.9, avro api not validate default value when construct avro schema).

Now user use new parquet-avro jar with avro 1.11.x, and do read action to read that old parquet file with avro schema string in file metadata, then will check the default value by parse avro schema in parquet AvroReadSupport, then issue will appear. That is why do this change here.

exception stack is this:

(org.apache.avro.AvroTypeException) Invalid default for field id: "" not a ["int","null"]
	at org.apache.avro.Schema.validateDefault(Schema.java:1636) ~[avro-1.11.2.jar:1.11.2]
	at org.apache.avro.Schema.access$500(Schema.java:94) ~[avro-1.11.2.jar:1.11.2]
	at org.apache.avro.Schema$Field.<init>(Schema.java:561) ~[avro-1.11.2.jar:1.11.2]
	at org.apache.avro.Schema.parse(Schema.java:1747) ~[avro-1.11.2.jar:1.11.2]
	at org.apache.avro.Schema$Parser.parse(Schema.java:1472) ~[avro-1.11.2.jar:1.11.2]
	at org.apache.avro.Schema$Parser.parse(Schema.java:1459) ~[avro-1.11.2.jar:1.11.2]
	at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124) ~[parquet-avro-1.10.1.jar:1.10.1]
	at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183) ~[parquet-hadoop-1.10.1.jar:1.10.1]
	at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) ~[parquet-hadoop-1.10.1.jar:1.10.1]
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) ~[parquet-hadoop-1.10.1.jar:1.10.1]

wwang-talend avatar Sep 14 '23 06:09 wwang-talend

Thanks @wwang-talend for opening the PR! Could you create a JIRA issue for this?

wgtmac avatar Sep 20 '23 01:09 wgtmac