spark icon indicating copy to clipboard operation
spark copied to clipboard

[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches

Open GideonPotok opened this issue 9 months ago • 0 comments

Here is the PR description for the alternative PR:


PR Description

Introduction

This PR proposes an alternative approach to the original implementation using TreeMap with a custom comparator for collation-sensitive grouping. The primary objective is to improve performance by leveraging OpenHashMap with a custom hashing strategy.

Benchmark Results

The initial TreeMap approach led to significant performance degradation, especially for unicode collations. After implementing a proof of concept using OpenHashMap, the slowdown was reduced considerably.

Benchmark Results Overview

Approach Slowdown (UTF8_BINARY Reference)
TreeMap 16.9x - 56x
OpenHashMap 9.5x - 15x

Details:

  • UTF8_BINARY (Baseline)
    • Red-Black Tree / TreeMap Implementation: Slowdown ranges from 16.9x to 56x
    • OpenHashMap Implementation: Slowdown ranges from 9.5x to 15x
  • The numerical benchmark case has been ignored for this analysis.

Proposed Implementation

  • Custom Hasher: Introduced a custom Hasher to allow collation-sensitive grouping.
    • Modified org.apache.spark.util.collection.OpenHashSet.Hasher.hash(), specifically an override of hash() within Hasher[String with Collation].
    • This new Hasher branches to an alternative hashing method that is collation-sensitive.
  • Benchmark Enhancement: Benchmarks now include multiple collation types and evaluate different algorithms.

Approach 3 (Prototype) In addition to TreeMap and OpenHashMap, a third approach has been introduced:

private def impl3(buff: OpenHashMap[AnyRef, Long]) = {
  val modeMap = buff.toSeq.groupMapReduce {
    case (key: String, _) =>
      CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
    case (key: UTF8String, _) =>
      CollationFactory.getCollationKey(key, collationId)
    case (key, _) => key
  }(x => x)((x, y) => (x._1, x._2 + y._2)).values
  modeMap
}

Next Steps

  • Cardinality and Duplicates: Further analysis needed to determine expected cardinalities in realistic scenarios. For example, even in large datasets with 1 trillion rows, it's unlikely that all rows will have unique values.
  • Benchmark Diversity: More varied benchmarks needed, especially in the experimental branch, to better evaluate different algorithms with varied duplicate values.

OpenJDK Benchmark Results

Configuration:

  • Java Version: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
  • OS: Mac OS X 14.4.1
  • CPU: Apple M3 Max

Collation Unit Benchmarks - Mode - 2000 Elements

Mode Map Type Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
Numerical Type N/A 0 0 0 30.5 32.8 9.7X
UTF8_BINARY N/A 0 0 0 16.4 60.8 5.2X
UTF8_BINARY_LCASE treemap 1 1 0 3.1 318.3 1.0X
UNICODE treemap 2 2 0 1.2 829.7 0.4X
UNICODE_CI treemap 2 2 0 1.1 877.5 0.4X
UTF8_BINARY_LCASE hashmap 1 1 0 3.4 298.3 1.1X
UNICODE hashmap 1 1 0 2.2 452.9 0.7X
UNICODE_CI hashmap 1 1 0 2.1 467.3 0.7X
UTF8_BINARY_LCASE mapreduce 0 0 0 5.3 187.6 1.7X
UNICODE mapreduce 0 0 0 5.9 169.6 1.9X
UNICODE_CI mapreduce 1 1 0 3.1 327.2 1.0X

Collation Unit Benchmarks - Mode - 4000 Elements

Mode Map Type Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 2 2 0 2.4 409.3 1.0X
UNICODE treemap 5 5 0 0.8 1190.5 0.3X
UTF8_BINARY N/A 0 0 0 14.4 69.2 5.9X
UNICODE_CI treemap 5 5 0 0.8 1213.4 0.3X
Numerical Type N/A 0 0 0 24.5 40.8 10.0X
UTF8_BINARY_LCASE hashmap 1 1 0 3.4 295.9 1.4X
UNICODE hashmap 2 2 0 1.8 556.8 0.7X
UNICODE_CI hashmap 2 2 0 2.2 459.8 0.9X
UTF8_BINARY_LCASE mapreduce 1 1 0 4.5 220.3 1.9X
UNICODE mapreduce 1 1 0 5.1 196.4 2.1X
UNICODE_CI mapreduce 1 2 0 2.7 364.8 1.1X
Mode Map Type Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 4 4 0 2.1 473.8 1.0X
UNICODE treemap 11 12 1 0.7 1385.7 0.3X
UTF8_BINARY treemap 1 1 0 13.9 71.7 6.6X
UNICODE_CI treemap 11 11 0 0.7 1348.2 0.4X
Numerical Type treemap 0 0 0 22.7 44.0 10.8X
UTF8_BINARY_LCASE hashmap 3 3 0 2.9 339.9 1.4X
UNICODE hashmap 5 5 0 1.8 568.0 0.8X
UTF8_BINARY hashmap 1 1 0 13.5 74.3 6.4X
UNICODE_CI hashmap 4 4 0 2.1 470.2 1.0X
Numerical Type hashmap 0 0 0 23.2 43.2 11.0X
UTF8_BINARY_LCASE mapreduce 2 2 0 3.9 255.7 1.9X
UNICODE mapreduce 2 2 0 4.3 230.2 2.1X
UTF8_BINARY mapreduce 1 1 0 14.8 67.5 7.0X
UNICODE_CI mapreduce 3 3 0 2.5 396.6 1.2X
Numerical Type mapreduce 0 0 0 23.7 42.2 11.2X

Thought: These benchmarks aren't on populations where the collation makes any difference.

OpenJDK Benchmark Results

Environment

  • JVM: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
  • OS: Mac OS X 14.4.1
  • Hardware: Apple M3 Max

Collation Unit Benchmarks (6000 Elements)

Mode Implementation Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 3 3 0 2.2 456.0 1.0X
UNICODE treemap 8 8 0 0.8 1274.2 0.4X
UTF8_BINARY treemap 0 1 1 14.1 71.0 6.4X
UNICODE_CI treemap 8 8 0 0.8 1273.0 0.4X
Numerical Type N/A 0 0 0 23.8 42.0 10.9X
UTF8_BINARY_LCASE hashmap 2 3 2 2.7 368.3 1.2X
UNICODE_CI hashmap 3 4 0 1.8 550.0 0.8X
UTF8_BINARY_LCASE mapreduce 1 2 0 4.1 242.4 1.9X
UNICODE_CI mapreduce 2 3 0 2.6 390.8 1.2X
UTF8_BINARY_LCASE treemap 0.05% 3 3 0 1.9 516.8 0.9X
UTF8_BINARY treemap 0.05% 0 1 0 13.2 76.0 6.0X
UNICODE_CI treemap 0.05% 8 9 2 0.8 1322.2 0.3X
UTF8_BINARY_LCASE hashmap 0.05% 2 2 0 2.8 362.8 1.3X
UNICODE_CI hashmap 0.05% 3 4 0 1.9 530.0 0.9X
UTF8_BINARY_LCASE mapreduce 0.05% 2 2 0 3.8 263.1 1.7X
UNICODE_CI mapreduce 0.05% 2 3 0 2.7 375.7 1.2X
UTF8_BINARY_LCASE treemap 0.1% 3 3 0 1.9 521.4 0.9X
UTF8_BINARY treemap 0.1% 0 1 0 12.6 79.4 5.7X
UNICODE_CI treemap 0.1% 8 9 1 0.7 1346.7 0.3X
UTF8_BINARY_LCASE hashmap 0.1% 2 2 0 2.8 356.8 1.3X
UNICODE_CI hashmap 0.1% 3 4 0 1.9 516.3 0.9X
UTF8_BINARY_LCASE mapreduce 0.1% 2 2 0 3.7 269.0 1.7X
UNICODE_CI mapreduce 0.1% 2 3 0 2.6 388.0 1.2X

Collation Unit Benchmarks (12000 Elements)

Mode Implementation Best Time (ms) Avg Time (ms) Stdev (ms) Rate (M/s) Per Row (ns) Relative
UTF8_BINARY_LCASE treemap 7 7 0 1.8 557.2 1.0X
UNICODE treemap 18 20 1 0.7 1492.5 0.4X
UTF8_BINARY treemap 1 1 0 11.9 84.3 6.6X
UNICODE_CI treemap 17 20 1 0.7 1433.1 0.4X
Numerical Type N/A 1 1 0 21.7 46.1 12.1X
UTF8_BINARY_LCASE hashmap 5 5 0 2.6 379.8 1.5X
UNICODE_CI hashmap 7 7 1 1.8 571.0 1.0X
UTF8_BINARY_LCASE mapreduce 3 4 0 3.6 279.3 2.0X
UNICODE_CI mapreduce 5 5 0 2.4 411.4 1.4X
UTF8_BINARY_LCASE treemap 0.05% 7 7 0 1.7 585.1 1.0X
UTF8_BINARY treemap 0.05% 1 1 0 10.8 92.5 6.0X
UNICODE_CI treemap 0.05% 17 19 1 0.7 1454.2 0.4X
UTF8_BINARY_LCASE hashmap 0.05% 5 5 0 2.5 400.6 1.4X
UNICODE_CI hashmap 0.05% 7 8 1 1.7 593.0 0.9X
UTF8_BINARY_LCASE mapreduce 0.05% 4 4 0 3.3 304.8 1.8X
UNICODE_CI mapreduce 0.05% 5 6 1 2.3 433.9 1.3X
UTF8_BINARY_LCASE treemap 0.1% 7 8 0 1.7 591.2 0.9X
UTF8_BINARY treemap 0.1% 1 1 0 11.1 90.5 6.2X
UNICODE_CI treemap 0.1% 18 19 1 0.7 1482.1 0.4X
UTF8_BINARY_LCASE hashmap 0.1% 4 5 0 2.7 373.5 1.5X
UNICODE_CI hashmap 0.1% 7 7 0 1.7 573.7 1.0X
UTF8_BINARY_LCASE mapreduce 0.1% 4 4 0 3.3 300.8 1.9X
UNICODE_CI mapreduce 0.1% 5 6 0 2.3 437.3 1.3X

GideonPotok avatar May 08 '24 21:05 GideonPotok