velox icon indicating copy to clipboard operation
velox copied to clipboard

Support complex types in sparksql hash and xxhash64 function

Open marin-ma opened this issue 10 months ago • 26 comments

Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

marin-ma avatar Apr 09 '24 07:04 marin-ma

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 6a270b3c8ffc088b48f6110008215f822e8af5a9
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/664d4e70b02bc80008958071

netlify[bot] avatar Apr 09 '24 07:04 netlify[bot]

@mbasmanova Could you please help to review? Thanks!

marin-ma avatar Apr 09 '24 11:04 marin-ma

@rui-mo @PHILO-HE Could you help to review again? Thanks!

marin-ma avatar Apr 11 '24 03:04 marin-ma

@mbasmanova Could you help to review again? Thanks!

marin-ma avatar Apr 11 '24 08:04 marin-ma

Is it possible to reconstruct code to avoid too much static functions?

jinchengchenghh avatar Apr 15 '24 01:04 jinchengchenghh

Please update the PR title to note support complex type in hash and xxhash function

jinchengchenghh avatar Apr 15 '24 01:04 jinchengchenghh

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

marin-ma avatar Apr 15 '24 01:04 marin-ma

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

All functions of the class are static looks strange for me, but current implement is also ok for me.

jinchengchenghh avatar Apr 15 '24 01:04 jinchengchenghh

@rui-mo @PHILO-HE @jinchengchenghh Do you have further comments? Thanks!

marin-ma avatar Apr 15 '24 03:04 marin-ma

LGTM

jinchengchenghh avatar Apr 15 '24 05:04 jinchengchenghh

@mbasmanova Could you help to review again? Thanks!

marin-ma avatar Apr 15 '24 09:04 marin-ma

@mbasmanova Any more comments? The function is used by Gluten columnar shuffle.

FelixYBW avatar Apr 19 '24 06:04 FelixYBW

@mbasmanova Could you help to review this patch? Thanks!

marin-ma avatar Apr 24 '24 08:04 marin-ma

@pedroerp Addressed comments. Could you help to review again? Thanks!

marin-ma avatar Apr 26 '24 07:04 marin-ma

@pedroerp Sorry that I was on leave for the past few days. Just addressed the comments. Could you help to review again? Thanks!

marin-ma avatar Apr 30 '24 01:04 marin-ma

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 30 '24 19:04 facebook-github-bot

@mbasmanova Could you help to review again? Thanks!

marin-ma avatar May 01 '24 03:05 marin-ma

@mbasmanova @pedroerp I created a benchmark in https://github.com/facebookincubator/velox/pull/9414/commits/9265b975152586ebdda5576314133a3c24120f53

Benchmark Result at commit https://github.com/facebookincubator/velox/pull/9414/commits/569500df1ffb26642c44233af75941561cd10c13

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     15.02ms     66.60
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 14.27ms     70.05
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.94ms     38.55
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.99ms     38.48

Benchmark Result at commit https://github.com/facebookincubator/velox/pull/9414/commits/9265b975152586ebdda5576314133a3c24120f53 (latest)

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     17.76ms     56.31
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 16.20ms     61.71
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.24ms     35.41
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.32ms     35.31

Seems like the previous implementation at https://github.com/facebookincubator/velox/pull/9414/commits/569500df1ffb26642c44233af75941561cd10c13 is better.

marin-ma avatar May 09 '24 12:05 marin-ma

I created a benchmark

@marin-ma Thanks. Would you confirm that result are from running benchmark in 'release' mode?

mbasmanova avatar May 09 '24 13:05 mbasmanova

@mbasmanova @pedroerp I created a benchmark in 9265b97

Benchmark Result at commit 569500d

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     15.02ms     66.60
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 14.27ms     70.05
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.94ms     38.55
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.99ms     38.48

Benchmark Result at commit 9265b97 (latest)

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     17.76ms     56.31
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 16.20ms     61.71
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.24ms     35.41
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.32ms     35.31

Seems like the previous implementation at 569500d is better.

I didn't go through your code. If I understand correctly, it's the comparison between an indirect branch prediction vs. direct branch prediction.

If the prediction is 100% random, indirect branch should be better. If it's 100% predictable, direct one is better.

FelixYBW avatar May 09 '24 23:05 FelixYBW

I didn't go through your code. If I understand correctly, it's the comparison between an indirect branch prediction vs. direct branch prediction.

If the prediction is 100% random, indirect branch should be better. If it's 100% predictable, direct one is better.

@marin-ma just for sure, let's collect some PMU events to confirm.

FelixYBW avatar May 09 '24 23:05 FelixYBW

Would you confirm that result are from running benchmark in 'release' mode?

@mbasmanova Yes, in 'release' mode.

marin-ma avatar May 10 '24 00:05 marin-ma

@mbasmanova , @marin-ma collected below data. conclusion is to use the virtual function call.

Virtual function call has 15% better performance. The root cause is it has 20% less instructions. It's expected the switch statement has more branch instructions. But in both case, the branch misprediction aren't a issue, miss ratio is 2.8%. Compared to Virtual function call, the switch method doesn't show higher branch misprediction because the switch in the function has fixed patten which comes from the schema, it can be very complex but BPU can handle this well as long as it has fixed patten. The misprediction ratio may increase if we have a really complex schema which exceeds BPU's history track buffer which is in 1000 level now.

per second virtual function call switch  
INST_RETIRED.ANY 7,850,078,349 8,586,837,809  
CPU_CLK_UNHALTED.REF_TSC 2,600,825,704 2,600,865,762  
CPU_CLK_UNHALTED.THREAD 3,094,305,687 3,094,298,604  
CYCLE_ACTIVITY.STALLS_L3_MISS 694,925 915,935  
LONGEST_LAT_CACHE.MISS 261,666 340,501  
MEM_LOAD_RETIRED.L3_MISS 14,705 17,211  
BR_INST_RETIRED.ALL_BRANCHES 1,087,985,060 1,400,282,806  
BR_MISP_RETIRED.ALL_BRANCHES 30,420,663 27,806,660  
BR_INST_RETIRED.INDIRECT 93,107,159 39,964,183  
BR_MISP_RETIRED.INDIRECT 1,900 145,261  
      virtual/switch
Instruction per loop 14,714 18,584 0.792
IPC 2.54 2.78  
loop per second 533,513 462,064 1.155
branch misprediction ratio 2.8% 2.0%
branch misprediction/1K inst 3.88 3.24
branch per loop 2,039 3,030

FelixYBW avatar May 10 '24 17:05 FelixYBW

@marin-ma Let's use virtual function way for now. Can you update the PR?

FelixYBW avatar May 11 '24 04:05 FelixYBW

@mbasmanova Reverted to virtual function implementation. Looks like the UT failure is not caused by this change. Could you help to review again? Thanks!

marin-ma avatar May 11 '24 04:05 marin-ma

      virtual/switch Instruction per loop 14,714 18,584 0.792 IPC 2.54 2.78   loop per second 533,513 462,064 1.155 branch misprediction ratio 2.8% 2.0% branch misprediction/1K inst 3.88 3.24 branch per loop 2,039 3,030

Add one more metric: branch misprediction per loop: virtual function call 57 switch 60

FelixYBW avatar May 13 '24 02:05 FelixYBW

@mbasmanova @pedroerp Can you help to review again? the function is key to Gluten to support complex datatype in shuffle.

FelixYBW avatar May 15 '24 05:05 FelixYBW

@FelixYBW @marin-ma Folks, I used your benchmark to compare performance before and after this PR for primitive types. I'm seeing a significant regression.

Before:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         585.43us     1.71K
hash_BIGINT##xxhash64                                     488.31us     2.05K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        488.94us     2.05K
hash_INTEGER##xxhash64                                    471.50us     2.12K
hash_VARCHAR##hash                                          2.21ms    452.07
hash_VARCHAR##xxhash64                                      1.97ms    508.80

After:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

mbasmanova avatar May 15 '24 15:05 mbasmanova

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

It's expected, the old solution just inline the hash function into the loop which is only 18 cycles for uint64_t input. The new solution use a virtual function call and maybe one more direct function call. Any extra cycle in the loop matters.

To optimize this, we can treat primitive type different from complex type which adds code complexity. @marin-ma Let's check if it's possible to replace the virtual function call by a simpler indirect function call.

If data size increases and cache misses grow, the gap will decrease because the memory latency will hide the extra cycles.

Currently the hash calculation time is very little compared to split, compression and shuffle write in Gluten. We may needn't put more effort on the optimizations.

FelixYBW avatar May 15 '24 19:05 FelixYBW

@marin-ma Did you test TPCH in Gluten using the PR? Let's do a test if not yet and see what's the perf lose.

FelixYBW avatar May 15 '24 19:05 FelixYBW