velox
velox copied to clipboard
Support complex types in sparksql hash and xxhash64 function
Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row.
The expected results in UT are obtained from spark's output.
Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609
Deploy Preview for meta-velox canceled.
Name | Link |
---|---|
Latest commit | 6a270b3c8ffc088b48f6110008215f822e8af5a9 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/664d4e70b02bc80008958071 |
@mbasmanova Could you please help to review? Thanks!
@rui-mo @PHILO-HE Could you help to review again? Thanks!
@mbasmanova Could you help to review again? Thanks!
Is it possible to reconstruct code to avoid too much static functions?
Please update the PR title to note support complex type in hash and xxhash
function
Is it possible to reconstruct code to avoid too much static functions?
Curious any downside of using static functions? The hash computations are stateless.
Is it possible to reconstruct code to avoid too much static functions?
Curious any downside of using static functions? The hash computations are stateless.
All functions of the class are static looks strange for me, but current implement is also ok for me.
@rui-mo @PHILO-HE @jinchengchenghh Do you have further comments? Thanks!
LGTM
@mbasmanova Could you help to review again? Thanks!
@mbasmanova Any more comments? The function is used by Gluten columnar shuffle.
@mbasmanova Could you help to review this patch? Thanks!
@pedroerp Addressed comments. Could you help to review again? Thanks!
@pedroerp Sorry that I was on leave for the past few days. Just addressed the comments. Could you help to review again? Thanks!
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@mbasmanova Could you help to review again? Thanks!
@mbasmanova @pedroerp I created a benchmark in https://github.com/facebookincubator/velox/pull/9414/commits/9265b975152586ebdda5576314133a3c24120f53
Benchmark Result at commit https://github.com/facebookincubator/velox/pull/9414/commits/569500df1ffb26642c44233af75941561cd10c13
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash 15.02ms 66.60
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64 14.27ms 70.05
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 25.94ms 38.55
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 25.99ms 38.48
Benchmark Result at commit https://github.com/facebookincubator/velox/pull/9414/commits/9265b975152586ebdda5576314133a3c24120f53 (latest)
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash 17.76ms 56.31
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64 16.20ms 61.71
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 28.24ms 35.41
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 28.32ms 35.31
Seems like the previous implementation at https://github.com/facebookincubator/velox/pull/9414/commits/569500df1ffb26642c44233af75941561cd10c13 is better.
I created a benchmark
@marin-ma Thanks. Would you confirm that result are from running benchmark in 'release' mode?
@mbasmanova @pedroerp I created a benchmark in 9265b97
Benchmark Result at commit 569500d
============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash 15.02ms 66.60 hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64 14.27ms 70.05 hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 25.94ms 38.55 hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 25.99ms 38.48
Benchmark Result at commit 9265b97 (latest)
============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash 17.76ms 56.31 hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64 16.20ms 61.71 hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 28.24ms 35.41 hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR 28.32ms 35.31
Seems like the previous implementation at 569500d is better.
I didn't go through your code. If I understand correctly, it's the comparison between an indirect branch prediction vs. direct branch prediction.
If the prediction is 100% random, indirect branch should be better. If it's 100% predictable, direct one is better.
I didn't go through your code. If I understand correctly, it's the comparison between an indirect branch prediction vs. direct branch prediction.
If the prediction is 100% random, indirect branch should be better. If it's 100% predictable, direct one is better.
@marin-ma just for sure, let's collect some PMU events to confirm.
Would you confirm that result are from running benchmark in 'release' mode?
@mbasmanova Yes, in 'release' mode.
@mbasmanova , @marin-ma collected below data. conclusion is to use the virtual function call.
Virtual function call has 15% better performance. The root cause is it has 20% less instructions. It's expected the switch statement has more branch instructions. But in both case, the branch misprediction aren't a issue, miss ratio is 2.8%. Compared to Virtual function call, the switch method doesn't show higher branch misprediction because the switch in the function has fixed patten which comes from the schema, it can be very complex but BPU can handle this well as long as it has fixed patten. The misprediction ratio may increase if we have a really complex schema which exceeds BPU's history track buffer which is in 1000 level now.
per second | virtual function call | switch | |
---|---|---|---|
INST_RETIRED.ANY | 7,850,078,349 | 8,586,837,809 | |
CPU_CLK_UNHALTED.REF_TSC | 2,600,825,704 | 2,600,865,762 | |
CPU_CLK_UNHALTED.THREAD | 3,094,305,687 | 3,094,298,604 | |
CYCLE_ACTIVITY.STALLS_L3_MISS | 694,925 | 915,935 | |
LONGEST_LAT_CACHE.MISS | 261,666 | 340,501 | |
MEM_LOAD_RETIRED.L3_MISS | 14,705 | 17,211 | |
BR_INST_RETIRED.ALL_BRANCHES | 1,087,985,060 | 1,400,282,806 | |
BR_MISP_RETIRED.ALL_BRANCHES | 30,420,663 | 27,806,660 | |
BR_INST_RETIRED.INDIRECT | 93,107,159 | 39,964,183 | |
BR_MISP_RETIRED.INDIRECT | 1,900 | 145,261 | |
virtual/switch | |||
Instruction per loop | 14,714 | 18,584 | 0.792 |
IPC | 2.54 | 2.78 | |
loop per second | 533,513 | 462,064 | 1.155 |
branch misprediction ratio | 2.8% | 2.0% | |
branch misprediction/1K inst | 3.88 | 3.24 | |
branch per loop | 2,039 | 3,030 |
@marin-ma Let's use virtual function way for now. Can you update the PR?
@mbasmanova Reverted to virtual function implementation. Looks like the UT failure is not caused by this change. Could you help to review again? Thanks!
virtual/switch Instruction per loop 14,714 18,584 0.792 IPC 2.54 2.78 loop per second 533,513 462,064 1.155 branch misprediction ratio 2.8% 2.0% branch misprediction/1K inst 3.88 3.24 branch per loop 2,039 3,030
Add one more metric: branch misprediction per loop: virtual function call 57 switch 60
@mbasmanova @pedroerp Can you help to review again? the function is key to Gluten to support complex datatype in shuffle.
@FelixYBW @marin-ma Folks, I used your benchmark to compare performance before and after this PR for primitive types. I'm seeing a significant regression.
Before:
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s
============================================================================
hash_BIGINT##hash 585.43us 1.71K
hash_BIGINT##xxhash64 488.31us 2.05K
----------------------------------------------------------------------------
hash_INTEGER##hash 488.94us 2.05K
hash_INTEGER##xxhash64 471.50us 2.12K
hash_VARCHAR##hash 2.21ms 452.07
hash_VARCHAR##xxhash64 1.97ms 508.80
After:
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s
============================================================================
hash_BIGINT##hash 883.61us 1.13K
hash_BIGINT##xxhash64 927.77us 1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash 808.87us 1.24K
hash_INTEGER##xxhash64 815.35us 1.23K
hash_VARCHAR##hash 2.67ms 374.56
hash_VARCHAR##xxhash64 2.49ms 402.09
============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ hash_BIGINT##hash 883.61us 1.13K hash_BIGINT##xxhash64 927.77us 1.08K ---------------------------------------------------------------------------- hash_INTEGER##hash 808.87us 1.24K hash_INTEGER##xxhash64 815.35us 1.23K hash_VARCHAR##hash 2.67ms 374.56 hash_VARCHAR##xxhash64 2.49ms 402.09
It's expected, the old solution just inline the hash function into the loop which is only 18 cycles for uint64_t input. The new solution use a virtual function call and maybe one more direct function call. Any extra cycle in the loop matters.
To optimize this, we can treat primitive type different from complex type which adds code complexity. @marin-ma Let's check if it's possible to replace the virtual function call by a simpler indirect function call.
If data size increases and cache misses grow, the gap will decrease because the memory latency will hide the extra cycles.
Currently the hash calculation time is very little compared to split, compression and shuffle write in Gluten. We may needn't put more effort on the optimizations.
@marin-ma Did you test TPCH in Gluten using the PR? Let's do a test if not yet and see what's the perf lose.