spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-49155][SQL][SS] Use more appropriate parameter type to construct `GenericArrayData`

Open LuciferYang opened this issue 1 year ago • 0 comments

What changes were proposed in this pull request?

Referring to the test results of GenericArrayDataBenchmark, using an Array of Any to construct GenericArrayData is more efficient compared to other scenarios:

https://github.com/apache/spark/blob/master/sql/catalyst/benchmarks/GenericArrayDataBenchmark-results.txt

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
AMD EPYC 7763 64-Core Processor
constructor:                              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
arrayOfAny                                            6              6           0       1620.1           0.6       1.0X
arrayOfAnyAsObject                                    6              6           0       1620.1           0.6       1.0X
arrayOfAnyAsSeq                                     155            155           1         64.7          15.5       0.0X
arrayOfInt                                          253            254           1         39.6          25.3       0.0X
arrayOfIntAsObject                                  252            253           1         39.7          25.2       0.0X

So this pr optimizes some processes of constructing GenericArrayData in Spark code:

  1. In ArraysZip#eval and XPathList#nullSafeEval, the originally defined arrays of specific types are changed to data of type AnyRef to avoid additional collection copying when constructing GenericArrayData. This is because the Array[AnyRef] type can also match the case array: Array[Any] => array branch in the following code:

https://github.com/apache/spark/blob/af70aafd330fdbb6ce0d5b3efbcb180cda488695/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L42-L48

  1. In HistogramNumeric#eval, an IndexedSeq[InternalRow] was originally used to construct GenericArrayData. Since the length of the collection is known, it can be refactored to use Array[AnyRef] to construct GenericArrayData.

  2. For other cases, when constructing GenericArrayData, the current input parameter is ${input}.toArray now. It is changed to ${input}.toArray[Any] to avoid another collection copy during the construction of GenericArrayData.

Why are the changes needed?

Using an Array of Any|AnyRef to construct GenericArrayData can improve performance by reducing collection copying.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang avatar Aug 08 '24 05:08 LuciferYang