orc icon indicating copy to clipboard operation
orc copied to clipboard

ORC-1700: Write parquet decimal type data in Benchmark using `FIXED_LEN_BYTE_ARRAY` type

Open cxzl25 opened this issue 1 year ago • 5 comments

What changes were proposed in this pull request?

This PR aims to write parquet decimal type data in Benchmark using FIXED_LEN_BYTE_ARRAY type.

Why are the changes needed?

Because the decimal type of the parquet file generated now corresponds to the binary type of parquet, but Spark3.5.1 does not support reading. Spark 3.5.1 supports reading if using the FIXED_LEN_BYTE_ARRAY type.

main

  optional binary fare_amount (DECIMAL(8,2));

PR

  optional fixed_len_byte_array(5) fare_amount (DECIMAL(10,2));
java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -format=parquet  -compress zstd -data taxi
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [fare_amount], physicalType: BINARY, logicalType: decimal(8,2)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.orc.bench.spark.SparkBenchmark.processReader(SparkBenchmark.java:170)
	at org.apache.orc.bench.spark.SparkBenchmark.fullRead(SparkBenchmark.java:216)
	at org.apache.orc.bench.spark.jmh_generated.SparkBenchmark_fullRead_jmhTest.fullRead_avgt_jmhStub(SparkBenchmark_fullRead_jmhTest.java:219)

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

No

cxzl25 avatar Apr 24 '24 11:04 cxzl25

BINARY type promotion can be supported in Spark 4.0.0, and the test using 4.0.0 SNAPSHOT can work in #1909

[SPARK-40876][SQL] Widening type promotions in Parquet readers

cxzl25 avatar Apr 24 '24 12:04 cxzl25

Should we use INT32 and INT64 for decimals where applicable?

wgtmac avatar Apr 27 '24 03:04 wgtmac

Should we use INT32 and INT64 for decimals where applicable?

Yes, Spark does this by default. It provides an option spark.sql.parquet.writeLegacyFormat=true to achieve alignment with Hive writing decimal method.

https://github.com/apache/spark/blob/8b8ea60bd4f22ea5763a77bac2d51f25d2479be9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L328-L339

    writeLegacyParquetFormat match {
      // Standard mode, 1 <= precision <= 9, writes as INT32
      case false if precision <= Decimal.MAX_INT_DIGITS => int32Writer


      // Standard mode, 10 <= precision <= 18, writes as INT64
      case false if precision <= Decimal.MAX_LONG_DIGITS => int64Writer


      // Legacy mode, 1 <= precision <= 18, writes as FIXED_LEN_BYTE_ARRAY
      case true if precision <= Decimal.MAX_LONG_DIGITS => binaryWriterUsingUnscaledLong


      // Either standard or legacy mode, 19 <= precision <= 38, writes as FIXED_LEN_BYTE_ARRAY
      case _ => binaryWriterUsingUnscaledBytes

https://github.com/apache/hive/blob/4614ce72a7f366674d89a3a78f687e419400cb89/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java#L568-L578

cxzl25 avatar Apr 28 '24 06:04 cxzl25

Could you please check if an email sent from [email protected] is accidently moved to the spam folder? @cxzl25

wgtmac avatar May 14 '24 08:05 wgtmac

Could you please check if an email sent from [email protected] is accidently moved to the spam folder

Wow, I did miss this email, thank you @wgtmac so much for inviting me, and thank you @dongjoon-hyun so much to for commenting and merging multiple times, and to the entire ORC community!

cxzl25 avatar May 14 '24 08:05 cxzl25

Sorry for the long delay. Thank you, @cxzl25 and @wgtmac . :)

dongjoon-hyun avatar Jul 09 '24 19:07 dongjoon-hyun

Merged to main/2.0.

dongjoon-hyun avatar Jul 09 '24 19:07 dongjoon-hyun