doris [feature](datalake) Add BucketShuffleJoin support for bucketed hive tables

Add BucketShuffleJoin support for bucketed hive tables generated by Spark. (27783)

Proposed changes

Issue Number: close #27783

1. Original planner updated to consider BucketShuffle for bucketed hive table
2. Neerids planner updated for bucketShuffle join on hive tables.
3. Added spark style hash calculation in BE for shuffle on one side.

###Sample Output:s NeredisPlanner OldPlanner

Nov 29 '23 17:11 Nitin-Kashyap

clang-tidy review says "All clean, LGTM! :+1:"

Nov 30 '23 03:11 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Nov 30 '23 05:11 github-actions[bot]

Hi @Nitin-Kashyap , thanks for your contribution. Could you please provide some create table stmt of hive table on spark side, so that we can test this case?

Dec 01 '23 01:12 morningman

BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?

Dec 01 '23 01:12 morningman

Hi @Nitin-Kashyap , thanks for your contribution. Could you please provide some create table stmt of hive table on spark side, so that we can test this case?

@morningman Please find the sample test I used for this case: -

CREATE TABLE parquet_test (
     user_id INT,
     key       VARCHAR(20),
     part      VARCAHAR(10)
)
USING parquet
PARTITIONED BY (part)
CLUSTERED BY (user_id) INTO 3 BUCKETS;

INSERT INTO parquet_test2 VALUES (31, 'U31', 'IN'),  (11,'U11','IN'), (21, 'U21', 'IN');

Dec 01 '23 04:12 Nitin-Kashyap

BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?

@morningman Yes, for current scope it will understand only Spark created bucketed table, it identifies this by Properties defined by spark for bucket specification.

I plan to take up supporting for Hive, Hudi as well in some time (hopefully in next PR); for this I have left a place holder THashType [HIVE_MOD: Hive and Hudi use the same hash method] however for hudi some more changes on FE side need to do for identifing type bucket id from file path.

Dec 01 '23 04:12 Nitin-Kashyap

clang-tidy review says "All clean, LGTM! :+1:"

Dec 02 '23 12:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 02 '23 12:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 04 '23 05:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 04 '23 18:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 06 '23 08:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 07 '23 15:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 12 '23 11:12 github-actions[bot]

Sorry for the late review, I will keep on to it.

Dec 19 '23 07:12 morningman

clang-tidy review says "All clean, LGTM! :+1:"

Dec 21 '23 12:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Dec 21 '23 12:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Jan 04 '24 21:01 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Jan 09 '24 05:01 github-actions[bot]

run buildall

Jan 09 '24 05:01 Nitin-Kashyap

TeamCity be ut coverage result: Function Coverage: 36.61% (8638/23593) Line Coverage: 28.63% (70200/245210) Region Coverage: 27.57% (36303/131686) Branch Coverage: 24.26% (18546/76462) Coverage Report: http://coverage.selectdb-in.cc/coverage/75b776c0ed94c30019159023d6dac51c658c5078_75b776c0ed94c30019159023d6dac51c658c5078/report/index.html

Jan 09 '24 07:01 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

Jan 09 '24 10:01 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Jan 18 '24 22:01 github-actions[bot]

run buildall

Jan 19 '24 04:01 Nitin-Kashyap

TPC-H: Total hot run time: 38764 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500, data reload: false

------ Round 1 ----------------------------------
q1	17656	5210	5317	5210
q2	2048	148	137	137
q3	10628	1111	1126	1111
q4	10218	840	836	836
q5	7762	3114	3142	3114
q6	193	120	119	119
q7	874	488	477	477
q8	9238	1964	1897	1897
q9	7239	6378	6347	6347
q10	8213	3011	3035	3011
q11	418	201	204	201
q12	360	184	188	184
q13	18002	3338	3323	3323
q14	248	209	209	209
q15	538	507	499	499
q16	427	409	362	362
q17	935	492	560	492
q18	7590	7162	6849	6849
q19	1576	1454	1356	1356
q20	588	293	313	293
q21	2837	2440	2505	2440
q22	362	325	297	297
Total cold run time: 107950 ms
Total hot run time: 38764 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5384	5230	5316	5230
q2	323	218	212	212
q3	3331	3250	3226	3226
q4	2041	2055	2050	2050
q5	5920	5940	5902	5902
q6	200	115	116	115
q7	2293	1920	1891	1891
q8	3205	3361	3376	3361
q9	9024	8777	8765	8765
q10	3955	3817	3787	3787
q11	555	456	442	442
q12	799	604	623	604
q13	16930	3173	3125	3125
q14	285	261	255	255
q15	562	498	502	498
q16	524	473	468	468
q17	2695	1778	1838	1778
q18	9552	18705	9596	9596
q19	23407	1543	1520	1520
q20	4632	1940	1931	1931
q21	14432	5391	5339	5339
q22	999	544	569	544
Total cold run time: 111048 ms
Total hot run time: 60639 ms

Jan 19 '24 06:01 doris-robot

TPC-DS: Total hot run time: 176929 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500, data reload: false

query1	934	336	332	332
query2	6572	1897	1814	1814
query3	6694	207	209	207
query4	30217	22280	22283	22280
query5	6902	600	600	600
query6	240	151	149	149
query7	4602	261	266	261
query8	217	172	172	172
query9	8332	2626	2613	2613
query10	437	219	236	219
query11	16727	15502	15556	15502
query12	127	68	68	68
query13	1687	392	372	372
query14	10507	6966	6879	6879
query15	201	190	183	183
query16	4597	234	225	225
query17	997	473	467	467
query18	1799	261	255	255
query19	162	125	125	125
query20	71	69	67	67
query21	194	126	120	120
query22	4845	4819	4803	4803
query23	31426	30960	30914	30914
query24	11895	2863	2797	2797
query25	567	302	312	302
query26	1606	144	146	144
query27	3220	263	275	263
query28	7431	1920	1895	1895
query29	1391	624	653	624
query30	282	133	135	133
query31	923	743	756	743
query32	89	50	50	50
query33	700	212	206	206
query34	1134	458	457	457
query35	882	732	728	728
query36	1324	1171	1131	1131
query37	92	59	57	57
query38	3314	3229	3236	3229
query39	1310	1258	1254	1254
query40	203	86	83	83
query41	39	34	34	34
query42	96	82	87	82
query43	534	456	488	456
query44	1099	699	700	699
query45	189	184	170	170
query46	1062	654	654	654
query47	1661	1524	1575	1524
query48	396	317	324	317
query49	1135	278	285	278
query50	672	322	313	313
query51	5306	5248	5287	5248
query52	100	73	76	73
query53	322	258	257	257
query54	419	287	277	277
query55	84	74	76	74
query56	178	160	156	156
query57	997	930	928	928
query58	185	155	159	155
query59	2857	2700	2710	2700
query60	207	184	184	184
query61	86	79	79	79
query62	617	356	363	356
query63	273	256	254	254
query64	5034	1740	1755	1740
query65	3345	3255	3234	3234
query66	1276	319	311	311
query67	15696	14999	15031	14999
query68	13562	545	503	503
query69	605	299	294	294
query70	1769	1482	1421	1421
query71	484	231	217	217
query72	4946	2778	2799	2778
query73	3391	331	327	327
query74	6932	6426	6436	6426
query75	4897	2332	2318	2318
query76	6198	1022	1048	1022
query77	692	225	245	225
query78	9145	8788	8535	8535
query79	1697	517	497	497
query80	549	320	305	305
query81	464	205	203	203
query82	213	84	77	77
query83	144	125	120	120
query84	281	69	69	69
query85	1086	327	313	313
query86	396	378	389	378
query87	3547	3314	3377	3314
query88	3032	2241	2245	2241
query89	442	363	358	358
query90	1950	189	192	189
query91	143	125	128	125
query92	53	43	43	43
query93	1574	471	431	431
query94	1153	156	155	155
query95	505	462	440	440
query96	627	325	324	324
query97	4263	4141	4128	4128
query98	214	190	189	189
query99	1027	730	689	689
Total cold run time: 294696 ms
Total hot run time: 176929 ms

Jan 19 '24 06:01 doris-robot

ClickBench: Total hot run time: 30.2 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500, data reload: false

query1	0.03	0.03	0.03
query2	0.06	0.02	0.02
query3	0.23	0.05	0.05
query4	1.69	0.07	0.07
query5	0.53	0.52	0.51
query6	1.32	0.64	0.61
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.55	0.51	0.49
query10	0.55	0.56	0.57
query11	0.12	0.09	0.09
query12	0.11	0.10	0.09
query13	0.61	0.60	0.60
query14	0.77	0.82	0.81
query15	0.79	0.79	0.78
query16	0.38	0.37	0.39
query17	1.02	1.01	1.02
query18	0.23	0.26	0.25
query19	1.83	1.83	1.79
query20	0.01	0.01	0.01
query21	15.41	0.59	0.59
query22	2.73	2.29	1.31
query23	17.47	0.94	0.70
query24	2.48	1.05	0.91
query25	0.46	0.16	0.16
query26	0.47	0.12	0.13
query27	0.07	0.04	0.06
query28	11.40	0.76	0.74
query29	12.52	3.08	3.19
query30	0.58	0.50	0.50
query31	2.78	0.34	0.34
query32	3.36	0.48	0.48
query33	3.21	3.27	3.23
query34	15.87	4.31	4.27
query35	4.35	4.22	4.30
query36	1.12	1.07	1.08
query37	0.06	0.04	0.04
query38	0.04	0.03	0.03
query39	0.02	0.02	0.02
query40	0.15	0.13	0.13
query41	0.07	0.01	0.01
query42	0.02	0.01	0.02
query43	0.03	0.02	0.02
Total cold run time: 105.56 s
Total hot run time: 30.2 s

Jan 19 '24 06:01 doris-robot

TeamCity be ut coverage result: Function Coverage: 36.67% (8655/23601) Line Coverage: 28.72% (70673/246104) Region Coverage: 27.59% (36451/132119) Branch Coverage: 24.31% (18652/76718) Coverage Report: http://coverage.selectdb-in.cc/coverage/f4c53d41c90ac80307af5a3baefc340a7cb87500_f4c53d41c90ac80307af5a3baefc340a7cb87500/report/index.html

Jan 19 '24 06:01 doris-robot

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500 with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          58 seconds loaded 1101869774 Bytes, about 18 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       14.6 seconds inserted 10000000 Rows, about 684K ops/s

Jan 19 '24 06:01 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

Jan 19 '24 19:01 github-actions[bot]

run buildall

Jan 20 '24 07:01 Nitin-Kashyap

doris doris copied to clipboard

[feature](datalake) Add BucketShuffleJoin support for bucketed hive tables

Proposed changes

doris
doris copied to clipboard