velox Support complex types in sparksql hash and xxhash64 function

Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

Apr 09 '24 07:04 marin-ma

Deploy Preview for meta-velox canceled.

Name	Link
Latest commit	6a270b3c8ffc088b48f6110008215f822e8af5a9
Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/664d4e70b02bc80008958071

Apr 09 '24 07:04 netlify[bot]

@mbasmanova Could you please help to review? Thanks!

Apr 09 '24 11:04 marin-ma

@rui-mo @PHILO-HE Could you help to review again? Thanks!

Apr 11 '24 03:04 marin-ma

@mbasmanova Could you help to review again? Thanks!

Apr 11 '24 08:04 marin-ma

Is it possible to reconstruct code to avoid too much static functions?

Apr 15 '24 01:04 jinchengchenghh

Please update the PR title to note support complex type in hash and xxhash function

Apr 15 '24 01:04 jinchengchenghh

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

Apr 15 '24 01:04 marin-ma

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

All functions of the class are static looks strange for me, but current implement is also ok for me.

Apr 15 '24 01:04 jinchengchenghh

@rui-mo @PHILO-HE @jinchengchenghh Do you have further comments? Thanks!

Apr 15 '24 03:04 marin-ma

LGTM

Apr 15 '24 05:04 jinchengchenghh

@mbasmanova Could you help to review again? Thanks!

Apr 15 '24 09:04 marin-ma

@mbasmanova Any more comments? The function is used by Gluten columnar shuffle.

Apr 19 '24 06:04 FelixYBW

@mbasmanova Could you help to review this patch? Thanks!

Apr 24 '24 08:04 marin-ma

@pedroerp Addressed comments. Could you help to review again? Thanks!

Apr 26 '24 07:04 marin-ma

@pedroerp Sorry that I was on leave for the past few days. Just addressed the comments. Could you help to review again? Thanks!

Apr 30 '24 01:04 marin-ma

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 30 '24 19:04 facebook-github-bot

@mbasmanova Could you help to review again? Thanks!

May 01 '24 03:05 marin-ma

@mbasmanova @pedroerp I created a benchmark in https://github.com/facebookincubator/velox/pull/9414/commits/9265b975152586ebdda5576314133a3c24120f53

Benchmark Result at commit https://github.com/facebookincubator/velox/pull/9414/commits/569500df1ffb26642c44233af75941561cd10c13

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     15.02ms     66.60
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 14.27ms     70.05
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.94ms     38.55
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.99ms     38.48

Benchmark Result at commit https://github.com/facebookincubator/velox/pull/9414/commits/9265b975152586ebdda5576314133a3c24120f53 (latest)

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     17.76ms     56.31
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 16.20ms     61.71
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.24ms     35.41
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.32ms     35.31

Seems like the previous implementation at https://github.com/facebookincubator/velox/pull/9414/commits/569500df1ffb26642c44233af75941561cd10c13 is better.

May 09 '24 12:05 marin-ma

I created a benchmark

@marin-ma Thanks. Would you confirm that result are from running benchmark in 'release' mode?

May 09 '24 13:05 mbasmanova

@mbasmanova @pedroerp I created a benchmark in 9265b97

Benchmark Result at commit 569500d

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     15.02ms     66.60
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 14.27ms     70.05
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.94ms     38.55
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            25.99ms     38.48

Benchmark Result at commit 9265b97 (latest)

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_ARRAY<MAP<INTEGER,VARCHAR>>##hash                     17.76ms     56.31
hash_ARRAY<MAP<INTEGER,VARCHAR>>##xxhash64                 16.20ms     61.71
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.24ms     35.41
hash_ROW<f_map:MAP<INTEGER,VARCHAR>,f_array:ARR            28.32ms     35.31

Seems like the previous implementation at 569500d is better.

I didn't go through your code. If I understand correctly, it's the comparison between an indirect branch prediction vs. direct branch prediction.

If the prediction is 100% random, indirect branch should be better. If it's 100% predictable, direct one is better.

May 09 '24 23:05 FelixYBW

I didn't go through your code. If I understand correctly, it's the comparison between an indirect branch prediction vs. direct branch prediction.

If the prediction is 100% random, indirect branch should be better. If it's 100% predictable, direct one is better.

@marin-ma just for sure, let's collect some PMU events to confirm.

May 09 '24 23:05 FelixYBW

Would you confirm that result are from running benchmark in 'release' mode?

@mbasmanova Yes, in 'release' mode.

May 10 '24 00:05 marin-ma

@mbasmanova , @marin-ma collected below data. conclusion is to use the virtual function call.

Virtual function call has 15% better performance. The root cause is it has 20% less instructions. It's expected the switch statement has more branch instructions. But in both case, the branch misprediction aren't a issue, miss ratio is 2.8%. Compared to Virtual function call, the switch method doesn't show higher branch misprediction because the switch in the function has fixed patten which comes from the schema, it can be very complex but BPU can handle this well as long as it has fixed patten. The misprediction ratio may increase if we have a really complex schema which exceeds BPU's history track buffer which is in 1000 level now.

per second	virtual function call	switch
INST_RETIRED.ANY	7,850,078,349	8,586,837,809
CPU_CLK_UNHALTED.REF_TSC	2,600,825,704	2,600,865,762
CPU_CLK_UNHALTED.THREAD	3,094,305,687	3,094,298,604
CYCLE_ACTIVITY.STALLS_L3_MISS	694,925	915,935
LONGEST_LAT_CACHE.MISS	261,666	340,501
MEM_LOAD_RETIRED.L3_MISS	14,705	17,211
BR_INST_RETIRED.ALL_BRANCHES	1,087,985,060	1,400,282,806
BR_MISP_RETIRED.ALL_BRANCHES	30,420,663	27,806,660
BR_INST_RETIRED.INDIRECT	93,107,159	39,964,183
BR_MISP_RETIRED.INDIRECT	1,900	145,261
			virtual/switch
Instruction per loop	14,714	18,584	0.792
IPC	2.54	2.78
loop per second	533,513	462,064	1.155
branch misprediction ratio	2.8%	2.0%
branch misprediction/1K inst	3.88	3.24
branch per loop	2,039	3,030

May 10 '24 17:05 FelixYBW

@marin-ma Let's use virtual function way for now. Can you update the PR?

May 11 '24 04:05 FelixYBW

@mbasmanova Reverted to virtual function implementation. Looks like the UT failure is not caused by this change. Could you help to review again? Thanks!

May 11 '24 04:05 marin-ma

virtual/switch Instruction per loop 14,714 18,584 0.792 IPC 2.54 2.78 loop per second 533,513 462,064 1.155 branch misprediction ratio 2.8% 2.0% branch misprediction/1K inst 3.88 3.24 branch per loop 2,039 3,030

Add one more metric: branch misprediction per loop: virtual function call 57 switch 60

May 13 '24 02:05 FelixYBW

@mbasmanova @pedroerp Can you help to review again? the function is key to Gluten to support complex datatype in shuffle.

May 15 '24 05:05 FelixYBW

@FelixYBW @marin-ma Folks, I used your benchmark to compare performance before and after this PR for primitive types. I'm seeing a significant regression.

Before:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         585.43us     1.71K
hash_BIGINT##xxhash64                                     488.31us     2.05K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        488.94us     2.05K
hash_INTEGER##xxhash64                                    471.50us     2.12K
hash_VARCHAR##hash                                          2.21ms    452.07
hash_VARCHAR##xxhash64                                      1.97ms    508.80

After:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

May 15 '24 15:05 mbasmanova

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

It's expected, the old solution just inline the hash function into the loop which is only 18 cycles for uint64_t input. The new solution use a virtual function call and maybe one more direct function call. Any extra cycle in the loop matters.

To optimize this, we can treat primitive type different from complex type which adds code complexity. @marin-ma Let's check if it's possible to replace the virtual function call by a simpler indirect function call.

If data size increases and cache misses grow, the gap will decrease because the memory latency will hide the extra cycles.

Currently the hash calculation time is very little compared to split, compression and shuffle write in Gluten. We may needn't put more effort on the optimizations.

May 15 '24 19:05 FelixYBW

@marin-ma Did you test TPCH in Gluten using the PR? Let's do a test if not yet and see what's the perf lose.

May 15 '24 19:05 FelixYBW

velox velox copied to clipboard

Support complex types in sparksql hash and xxhash64 function

✅ Deploy Preview for meta-velox canceled.

velox
velox copied to clipboard

Deploy Preview for meta-velox canceled.