ORC-1700: Write parquet decimal type data in Benchmark using `FIXED_LEN_BYTE_ARRAY` type
What changes were proposed in this pull request?
This PR aims to write parquet decimal type data in Benchmark using FIXED_LEN_BYTE_ARRAY type.
Why are the changes needed?
Because the decimal type of the parquet file generated now corresponds to the binary type of parquet, but Spark3.5.1 does not support reading.
Spark 3.5.1 supports reading if using the FIXED_LEN_BYTE_ARRAY type.
main
optional binary fare_amount (DECIMAL(8,2));
PR
optional fixed_len_byte_array(5) fare_amount (DECIMAL(10,2));
java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -format=parquet -compress zstd -data taxi
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [fare_amount], physicalType: BINARY, logicalType: decimal(8,2)
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.orc.bench.spark.SparkBenchmark.processReader(SparkBenchmark.java:170)
at org.apache.orc.bench.spark.SparkBenchmark.fullRead(SparkBenchmark.java:216)
at org.apache.orc.bench.spark.jmh_generated.SparkBenchmark_fullRead_jmhTest.fullRead_avgt_jmhStub(SparkBenchmark_fullRead_jmhTest.java:219)
How was this patch tested?
local test
Was this patch authored or co-authored using generative AI tooling?
No
BINARY type promotion can be supported in Spark 4.0.0, and the test using 4.0.0 SNAPSHOT can work in #1909
[SPARK-40876][SQL] Widening type promotions in Parquet readers
Should we use INT32 and INT64 for decimals where applicable?
Should we use INT32 and INT64 for decimals where applicable?
Yes, Spark does this by default. It provides an option spark.sql.parquet.writeLegacyFormat=true to achieve alignment with Hive writing decimal method.
https://github.com/apache/spark/blob/8b8ea60bd4f22ea5763a77bac2d51f25d2479be9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L328-L339
writeLegacyParquetFormat match {
// Standard mode, 1 <= precision <= 9, writes as INT32
case false if precision <= Decimal.MAX_INT_DIGITS => int32Writer
// Standard mode, 10 <= precision <= 18, writes as INT64
case false if precision <= Decimal.MAX_LONG_DIGITS => int64Writer
// Legacy mode, 1 <= precision <= 18, writes as FIXED_LEN_BYTE_ARRAY
case true if precision <= Decimal.MAX_LONG_DIGITS => binaryWriterUsingUnscaledLong
// Either standard or legacy mode, 19 <= precision <= 38, writes as FIXED_LEN_BYTE_ARRAY
case _ => binaryWriterUsingUnscaledBytes
https://github.com/apache/hive/blob/4614ce72a7f366674d89a3a78f687e419400cb89/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java#L568-L578
Could you please check if an email sent from [email protected] is accidently moved to the spam folder? @cxzl25
Could you please check if an email sent from [email protected] is accidently moved to the spam folder
Wow, I did miss this email, thank you @wgtmac so much for inviting me, and thank you @dongjoon-hyun so much to for commenting and merging multiple times, and to the entire ORC community!
Sorry for the long delay. Thank you, @cxzl25 and @wgtmac . :)
Merged to main/2.0.