spark
spark copied to clipboard
[WIP][SPARK-47353][SQL][Prototype of alternative algorithm] Enable collation support for the Mode expression using multiple experimental approaches
Here is the PR description for the alternative PR:
PR Description
Introduction
This PR proposes an alternative approach to the original implementation using TreeMap
with a custom comparator for collation-sensitive grouping. The primary objective is to improve performance by leveraging OpenHashMap
with a custom hashing strategy.
Benchmark Results
The initial TreeMap
approach led to significant performance degradation, especially for unicode collations. After implementing a proof of concept using OpenHashMap
, the slowdown was reduced considerably.
Benchmark Results Overview
Approach | Slowdown (UTF8_BINARY Reference) |
---|---|
TreeMap |
16.9x - 56x |
OpenHashMap |
9.5x - 15x |
Details:
-
UTF8_BINARY (Baseline)
- Red-Black Tree / TreeMap Implementation: Slowdown ranges from 16.9x to 56x
-
OpenHashMap
Implementation: Slowdown ranges from 9.5x to 15x
- The numerical benchmark case has been ignored for this analysis.
Proposed Implementation
-
Custom Hasher: Introduced a custom
Hasher
to allow collation-sensitive grouping.- Modified
org.apache.spark.util.collection.OpenHashSet.Hasher.hash()
, specifically an override ofhash()
withinHasher[String with Collation]
. - This new
Hasher
branches to an alternative hashing method that is collation-sensitive.
- Modified
- Benchmark Enhancement: Benchmarks now include multiple collation types and evaluate different algorithms.
Approach 3 (Prototype)
In addition to TreeMap
and OpenHashMap
, a third approach has been introduced:
private def impl3(buff: OpenHashMap[AnyRef, Long]) = {
val modeMap = buff.toSeq.groupMapReduce {
case (key: String, _) =>
CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
case (key: UTF8String, _) =>
CollationFactory.getCollationKey(key, collationId)
case (key, _) => key
}(x => x)((x, y) => (x._1, x._2 + y._2)).values
modeMap
}
Next Steps
- Cardinality and Duplicates: Further analysis needed to determine expected cardinalities in realistic scenarios. For example, even in large datasets with 1 trillion rows, it's unlikely that all rows will have unique values.
- Benchmark Diversity: More varied benchmarks needed, especially in the experimental branch, to better evaluate different algorithms with varied duplicate values.
OpenJDK Benchmark Results
Configuration:
- Java Version: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
- OS: Mac OS X 14.4.1
- CPU: Apple M3 Max
Collation Unit Benchmarks - Mode - 2000 Elements
Mode | Map Type | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
---|---|---|---|---|---|---|---|
Numerical Type | N/A | 0 | 0 | 0 | 30.5 | 32.8 | 9.7X |
UTF8_BINARY | N/A | 0 | 0 | 0 | 16.4 | 60.8 | 5.2X |
UTF8_BINARY_LCASE | treemap | 1 | 1 | 0 | 3.1 | 318.3 | 1.0X |
UNICODE | treemap | 2 | 2 | 0 | 1.2 | 829.7 | 0.4X |
UNICODE_CI | treemap | 2 | 2 | 0 | 1.1 | 877.5 | 0.4X |
UTF8_BINARY_LCASE | hashmap | 1 | 1 | 0 | 3.4 | 298.3 | 1.1X |
UNICODE | hashmap | 1 | 1 | 0 | 2.2 | 452.9 | 0.7X |
UNICODE_CI | hashmap | 1 | 1 | 0 | 2.1 | 467.3 | 0.7X |
UTF8_BINARY_LCASE | mapreduce | 0 | 0 | 0 | 5.3 | 187.6 | 1.7X |
UNICODE | mapreduce | 0 | 0 | 0 | 5.9 | 169.6 | 1.9X |
UNICODE_CI | mapreduce | 1 | 1 | 0 | 3.1 | 327.2 | 1.0X |
Collation Unit Benchmarks - Mode - 4000 Elements
Mode | Map Type | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
---|---|---|---|---|---|---|---|
UTF8_BINARY_LCASE | treemap | 2 | 2 | 0 | 2.4 | 409.3 | 1.0X |
UNICODE | treemap | 5 | 5 | 0 | 0.8 | 1190.5 | 0.3X |
UTF8_BINARY | N/A | 0 | 0 | 0 | 14.4 | 69.2 | 5.9X |
UNICODE_CI | treemap | 5 | 5 | 0 | 0.8 | 1213.4 | 0.3X |
Numerical Type | N/A | 0 | 0 | 0 | 24.5 | 40.8 | 10.0X |
UTF8_BINARY_LCASE | hashmap | 1 | 1 | 0 | 3.4 | 295.9 | 1.4X |
UNICODE | hashmap | 2 | 2 | 0 | 1.8 | 556.8 | 0.7X |
UNICODE_CI | hashmap | 2 | 2 | 0 | 2.2 | 459.8 | 0.9X |
UTF8_BINARY_LCASE | mapreduce | 1 | 1 | 0 | 4.5 | 220.3 | 1.9X |
UNICODE | mapreduce | 1 | 1 | 0 | 5.1 | 196.4 | 2.1X |
UNICODE_CI | mapreduce | 1 | 2 | 0 | 2.7 | 364.8 | 1.1X |
Mode | Map Type | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
---|---|---|---|---|---|---|---|
UTF8_BINARY_LCASE | treemap | 4 | 4 | 0 | 2.1 | 473.8 | 1.0X |
UNICODE | treemap | 11 | 12 | 1 | 0.7 | 1385.7 | 0.3X |
UTF8_BINARY | treemap | 1 | 1 | 0 | 13.9 | 71.7 | 6.6X |
UNICODE_CI | treemap | 11 | 11 | 0 | 0.7 | 1348.2 | 0.4X |
Numerical Type | treemap | 0 | 0 | 0 | 22.7 | 44.0 | 10.8X |
UTF8_BINARY_LCASE | hashmap | 3 | 3 | 0 | 2.9 | 339.9 | 1.4X |
UNICODE | hashmap | 5 | 5 | 0 | 1.8 | 568.0 | 0.8X |
UTF8_BINARY | hashmap | 1 | 1 | 0 | 13.5 | 74.3 | 6.4X |
UNICODE_CI | hashmap | 4 | 4 | 0 | 2.1 | 470.2 | 1.0X |
Numerical Type | hashmap | 0 | 0 | 0 | 23.2 | 43.2 | 11.0X |
UTF8_BINARY_LCASE | mapreduce | 2 | 2 | 0 | 3.9 | 255.7 | 1.9X |
UNICODE | mapreduce | 2 | 2 | 0 | 4.3 | 230.2 | 2.1X |
UTF8_BINARY | mapreduce | 1 | 1 | 0 | 14.8 | 67.5 | 7.0X |
UNICODE_CI | mapreduce | 3 | 3 | 0 | 2.5 | 396.6 | 1.2X |
Numerical Type | mapreduce | 0 | 0 | 0 | 23.7 | 42.2 | 11.2X |
Thought: These benchmarks aren't on populations where the collation makes any difference.
OpenJDK Benchmark Results
Environment
- JVM: OpenJDK 64-Bit Server VM 21.0.2+13-LTS
- OS: Mac OS X 14.4.1
- Hardware: Apple M3 Max
Collation Unit Benchmarks (6000 Elements)
Mode | Implementation | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
---|---|---|---|---|---|---|---|
UTF8_BINARY_LCASE | treemap | 3 | 3 | 0 | 2.2 | 456.0 | 1.0X |
UNICODE | treemap | 8 | 8 | 0 | 0.8 | 1274.2 | 0.4X |
UTF8_BINARY | treemap | 0 | 1 | 1 | 14.1 | 71.0 | 6.4X |
UNICODE_CI | treemap | 8 | 8 | 0 | 0.8 | 1273.0 | 0.4X |
Numerical Type | N/A | 0 | 0 | 0 | 23.8 | 42.0 | 10.9X |
UTF8_BINARY_LCASE | hashmap | 2 | 3 | 2 | 2.7 | 368.3 | 1.2X |
UNICODE_CI | hashmap | 3 | 4 | 0 | 1.8 | 550.0 | 0.8X |
UTF8_BINARY_LCASE | mapreduce | 1 | 2 | 0 | 4.1 | 242.4 | 1.9X |
UNICODE_CI | mapreduce | 2 | 3 | 0 | 2.6 | 390.8 | 1.2X |
UTF8_BINARY_LCASE | treemap 0.05% | 3 | 3 | 0 | 1.9 | 516.8 | 0.9X |
UTF8_BINARY | treemap 0.05% | 0 | 1 | 0 | 13.2 | 76.0 | 6.0X |
UNICODE_CI | treemap 0.05% | 8 | 9 | 2 | 0.8 | 1322.2 | 0.3X |
UTF8_BINARY_LCASE | hashmap 0.05% | 2 | 2 | 0 | 2.8 | 362.8 | 1.3X |
UNICODE_CI | hashmap 0.05% | 3 | 4 | 0 | 1.9 | 530.0 | 0.9X |
UTF8_BINARY_LCASE | mapreduce 0.05% | 2 | 2 | 0 | 3.8 | 263.1 | 1.7X |
UNICODE_CI | mapreduce 0.05% | 2 | 3 | 0 | 2.7 | 375.7 | 1.2X |
UTF8_BINARY_LCASE | treemap 0.1% | 3 | 3 | 0 | 1.9 | 521.4 | 0.9X |
UTF8_BINARY | treemap 0.1% | 0 | 1 | 0 | 12.6 | 79.4 | 5.7X |
UNICODE_CI | treemap 0.1% | 8 | 9 | 1 | 0.7 | 1346.7 | 0.3X |
UTF8_BINARY_LCASE | hashmap 0.1% | 2 | 2 | 0 | 2.8 | 356.8 | 1.3X |
UNICODE_CI | hashmap 0.1% | 3 | 4 | 0 | 1.9 | 516.3 | 0.9X |
UTF8_BINARY_LCASE | mapreduce 0.1% | 2 | 2 | 0 | 3.7 | 269.0 | 1.7X |
UNICODE_CI | mapreduce 0.1% | 2 | 3 | 0 | 2.6 | 388.0 | 1.2X |
Collation Unit Benchmarks (12000 Elements)
Mode | Implementation | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
---|---|---|---|---|---|---|---|
UTF8_BINARY_LCASE | treemap | 7 | 7 | 0 | 1.8 | 557.2 | 1.0X |
UNICODE | treemap | 18 | 20 | 1 | 0.7 | 1492.5 | 0.4X |
UTF8_BINARY | treemap | 1 | 1 | 0 | 11.9 | 84.3 | 6.6X |
UNICODE_CI | treemap | 17 | 20 | 1 | 0.7 | 1433.1 | 0.4X |
Numerical Type | N/A | 1 | 1 | 0 | 21.7 | 46.1 | 12.1X |
UTF8_BINARY_LCASE | hashmap | 5 | 5 | 0 | 2.6 | 379.8 | 1.5X |
UNICODE_CI | hashmap | 7 | 7 | 1 | 1.8 | 571.0 | 1.0X |
UTF8_BINARY_LCASE | mapreduce | 3 | 4 | 0 | 3.6 | 279.3 | 2.0X |
UNICODE_CI | mapreduce | 5 | 5 | 0 | 2.4 | 411.4 | 1.4X |
UTF8_BINARY_LCASE | treemap 0.05% | 7 | 7 | 0 | 1.7 | 585.1 | 1.0X |
UTF8_BINARY | treemap 0.05% | 1 | 1 | 0 | 10.8 | 92.5 | 6.0X |
UNICODE_CI | treemap 0.05% | 17 | 19 | 1 | 0.7 | 1454.2 | 0.4X |
UTF8_BINARY_LCASE | hashmap 0.05% | 5 | 5 | 0 | 2.5 | 400.6 | 1.4X |
UNICODE_CI | hashmap 0.05% | 7 | 8 | 1 | 1.7 | 593.0 | 0.9X |
UTF8_BINARY_LCASE | mapreduce 0.05% | 4 | 4 | 0 | 3.3 | 304.8 | 1.8X |
UNICODE_CI | mapreduce 0.05% | 5 | 6 | 1 | 2.3 | 433.9 | 1.3X |
UTF8_BINARY_LCASE | treemap 0.1% | 7 | 8 | 0 | 1.7 | 591.2 | 0.9X |
UTF8_BINARY | treemap 0.1% | 1 | 1 | 0 | 11.1 | 90.5 | 6.2X |
UNICODE_CI | treemap 0.1% | 18 | 19 | 1 | 0.7 | 1482.1 | 0.4X |
UTF8_BINARY_LCASE | hashmap 0.1% | 4 | 5 | 0 | 2.7 | 373.5 | 1.5X |
UNICODE_CI | hashmap 0.1% | 7 | 7 | 0 | 1.7 | 573.7 | 1.0X |
UTF8_BINARY_LCASE | mapreduce 0.1% | 4 | 4 | 0 | 3.3 | 300.8 | 1.9X |
UNICODE_CI | mapreduce 0.1% | 5 | 6 | 0 | 2.3 | 437.3 | 1.3X |