orc ORC-1700: Write parquet decimal type data in Benchmark using `FIXED_LEN_BYTE

What changes were proposed in this pull request?

This PR aims to write parquet decimal type data in Benchmark using FIXED_LEN_BYTE_ARRAY type.

Why are the changes needed?

Because the decimal type of the parquet file generated now corresponds to the binary type of parquet, but Spark3.5.1 does not support reading. Spark 3.5.1 supports reading if using the FIXED_LEN_BYTE_ARRAY type.

main

  optional binary fare_amount (DECIMAL(8,2));

PR

  optional fixed_len_byte_array(5) fare_amount (DECIMAL(10,2));

java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -format=parquet  -compress zstd -data taxi

org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [fare_amount], physicalType: BINARY, logicalType: decimal(8,2)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.orc.bench.spark.SparkBenchmark.processReader(SparkBenchmark.java:170)
	at org.apache.orc.bench.spark.SparkBenchmark.fullRead(SparkBenchmark.java:216)
	at org.apache.orc.bench.spark.jmh_generated.SparkBenchmark_fullRead_jmhTest.fullRead_avgt_jmhStub(SparkBenchmark_fullRead_jmhTest.java:219)

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

No

Apr 24 '24 11:04 cxzl25

BINARY type promotion can be supported in Spark 4.0.0, and the test using 4.0.0 SNAPSHOT can work in #1909

[SPARK-40876][SQL] Widening type promotions in Parquet readers

Apr 24 '24 12:04 cxzl25

Should we use INT32 and INT64 for decimals where applicable?

Apr 27 '24 03:04 wgtmac

Should we use INT32 and INT64 for decimals where applicable?

Yes, Spark does this by default. It provides an option spark.sql.parquet.writeLegacyFormat=true to achieve alignment with Hive writing decimal method.

https://github.com/apache/spark/blob/8b8ea60bd4f22ea5763a77bac2d51f25d2479be9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L328-L339

    writeLegacyParquetFormat match {
      // Standard mode, 1 <= precision <= 9, writes as INT32
      case false if precision <= Decimal.MAX_INT_DIGITS => int32Writer


      // Standard mode, 10 <= precision <= 18, writes as INT64
      case false if precision <= Decimal.MAX_LONG_DIGITS => int64Writer


      // Legacy mode, 1 <= precision <= 18, writes as FIXED_LEN_BYTE_ARRAY
      case true if precision <= Decimal.MAX_LONG_DIGITS => binaryWriterUsingUnscaledLong


      // Either standard or legacy mode, 19 <= precision <= 38, writes as FIXED_LEN_BYTE_ARRAY
      case _ => binaryWriterUsingUnscaledBytes

https://github.com/apache/hive/blob/4614ce72a7f366674d89a3a78f687e419400cb89/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java#L568-L578

Apr 28 '24 06:04 cxzl25

Could you please check if an email sent from [email protected] is accidently moved to the spam folder? @cxzl25

May 14 '24 08:05 wgtmac

Could you please check if an email sent from [email protected] is accidently moved to the spam folder

Wow, I did miss this email, thank you @wgtmac so much for inviting me, and thank you @dongjoon-hyun so much to for commenting and merging multiple times, and to the entire ORC community!

May 14 '24 08:05 cxzl25

Sorry for the long delay. Thank you, @cxzl25 and @wgtmac . :)

Jul 09 '24 19:07 dongjoon-hyun

Merged to main/2.0.

Jul 09 '24 19:07 dongjoon-hyun

ORC-1700: Write parquet decimal type data in Benchmark using `FIXED_LEN_BYTE_ARRAY` type

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?