doris
doris copied to clipboard
[feature](datalake) Add BucketShuffleJoin support for bucketed hive tables
Add BucketShuffleJoin support for bucketed hive tables generated by Spark. (27783)
Proposed changes
Issue Number: close #27783
1. Original planner updated to consider BucketShuffle for bucketed hive table
2. Neerids planner updated for bucketShuffle join on hive tables.
3. Added spark style hash calculation in BE for shuffle on one side.
###Sample Output:s
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
Hi @Nitin-Kashyap , thanks for your contribution.
Could you please provide some create table
stmt of hive table on spark side,
so that we can test this case?
BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?
Hi @Nitin-Kashyap , thanks for your contribution. Could you please provide some
create table
stmt of hive table on spark side, so that we can test this case?
@morningman Please find the sample test I used for this case: -
CREATE TABLE parquet_test (
user_id INT,
key VARCHAR(20),
part VARCAHAR(10)
)
USING parquet
PARTITIONED BY (part)
CLUSTERED BY (user_id) INTO 3 BUCKETS;
INSERT INTO parquet_test2 VALUES (31, 'U31', 'IN'), (11,'U11','IN'), (21, 'U21', 'IN');
BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?
@morningman Yes, for current scope it will understand only Spark created bucketed table, it identifies this by Properties defined by spark for bucket specification.
I plan to take up supporting for Hive, Hudi as well in some time (hopefully in next PR); for this I have left a place holder THashType [HIVE_MOD: Hive and Hudi use the same hash method] however for hudi some more changes on FE side need to do for identifing type bucket id from file path.
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
Sorry for the late review, I will keep on to it.
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
TeamCity be ut coverage result: Function Coverage: 36.61% (8638/23593) Line Coverage: 28.63% (70200/245210) Region Coverage: 27.57% (36303/131686) Branch Coverage: 24.26% (18546/76462) Coverage Report: http://coverage.selectdb-in.cc/coverage/75b776c0ed94c30019159023d6dac51c658c5078_75b776c0ed94c30019159023d6dac51c658c5078/report/index.html
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
TPC-H: Total hot run time: 38764 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500, data reload: false
------ Round 1 ----------------------------------
q1 17656 5210 5317 5210
q2 2048 148 137 137
q3 10628 1111 1126 1111
q4 10218 840 836 836
q5 7762 3114 3142 3114
q6 193 120 119 119
q7 874 488 477 477
q8 9238 1964 1897 1897
q9 7239 6378 6347 6347
q10 8213 3011 3035 3011
q11 418 201 204 201
q12 360 184 188 184
q13 18002 3338 3323 3323
q14 248 209 209 209
q15 538 507 499 499
q16 427 409 362 362
q17 935 492 560 492
q18 7590 7162 6849 6849
q19 1576 1454 1356 1356
q20 588 293 313 293
q21 2837 2440 2505 2440
q22 362 325 297 297
Total cold run time: 107950 ms
Total hot run time: 38764 ms
----- Round 2, with runtime_filter_mode=off -----
q1 5384 5230 5316 5230
q2 323 218 212 212
q3 3331 3250 3226 3226
q4 2041 2055 2050 2050
q5 5920 5940 5902 5902
q6 200 115 116 115
q7 2293 1920 1891 1891
q8 3205 3361 3376 3361
q9 9024 8777 8765 8765
q10 3955 3817 3787 3787
q11 555 456 442 442
q12 799 604 623 604
q13 16930 3173 3125 3125
q14 285 261 255 255
q15 562 498 502 498
q16 524 473 468 468
q17 2695 1778 1838 1778
q18 9552 18705 9596 9596
q19 23407 1543 1520 1520
q20 4632 1940 1931 1931
q21 14432 5391 5339 5339
q22 999 544 569 544
Total cold run time: 111048 ms
Total hot run time: 60639 ms
TPC-DS: Total hot run time: 176929 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500, data reload: false
query1 934 336 332 332
query2 6572 1897 1814 1814
query3 6694 207 209 207
query4 30217 22280 22283 22280
query5 6902 600 600 600
query6 240 151 149 149
query7 4602 261 266 261
query8 217 172 172 172
query9 8332 2626 2613 2613
query10 437 219 236 219
query11 16727 15502 15556 15502
query12 127 68 68 68
query13 1687 392 372 372
query14 10507 6966 6879 6879
query15 201 190 183 183
query16 4597 234 225 225
query17 997 473 467 467
query18 1799 261 255 255
query19 162 125 125 125
query20 71 69 67 67
query21 194 126 120 120
query22 4845 4819 4803 4803
query23 31426 30960 30914 30914
query24 11895 2863 2797 2797
query25 567 302 312 302
query26 1606 144 146 144
query27 3220 263 275 263
query28 7431 1920 1895 1895
query29 1391 624 653 624
query30 282 133 135 133
query31 923 743 756 743
query32 89 50 50 50
query33 700 212 206 206
query34 1134 458 457 457
query35 882 732 728 728
query36 1324 1171 1131 1131
query37 92 59 57 57
query38 3314 3229 3236 3229
query39 1310 1258 1254 1254
query40 203 86 83 83
query41 39 34 34 34
query42 96 82 87 82
query43 534 456 488 456
query44 1099 699 700 699
query45 189 184 170 170
query46 1062 654 654 654
query47 1661 1524 1575 1524
query48 396 317 324 317
query49 1135 278 285 278
query50 672 322 313 313
query51 5306 5248 5287 5248
query52 100 73 76 73
query53 322 258 257 257
query54 419 287 277 277
query55 84 74 76 74
query56 178 160 156 156
query57 997 930 928 928
query58 185 155 159 155
query59 2857 2700 2710 2700
query60 207 184 184 184
query61 86 79 79 79
query62 617 356 363 356
query63 273 256 254 254
query64 5034 1740 1755 1740
query65 3345 3255 3234 3234
query66 1276 319 311 311
query67 15696 14999 15031 14999
query68 13562 545 503 503
query69 605 299 294 294
query70 1769 1482 1421 1421
query71 484 231 217 217
query72 4946 2778 2799 2778
query73 3391 331 327 327
query74 6932 6426 6436 6426
query75 4897 2332 2318 2318
query76 6198 1022 1048 1022
query77 692 225 245 225
query78 9145 8788 8535 8535
query79 1697 517 497 497
query80 549 320 305 305
query81 464 205 203 203
query82 213 84 77 77
query83 144 125 120 120
query84 281 69 69 69
query85 1086 327 313 313
query86 396 378 389 378
query87 3547 3314 3377 3314
query88 3032 2241 2245 2241
query89 442 363 358 358
query90 1950 189 192 189
query91 143 125 128 125
query92 53 43 43 43
query93 1574 471 431 431
query94 1153 156 155 155
query95 505 462 440 440
query96 627 325 324 324
query97 4263 4141 4128 4128
query98 214 190 189 189
query99 1027 730 689 689
Total cold run time: 294696 ms
Total hot run time: 176929 ms
ClickBench: Total hot run time: 30.2 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500, data reload: false
query1 0.03 0.03 0.03
query2 0.06 0.02 0.02
query3 0.23 0.05 0.05
query4 1.69 0.07 0.07
query5 0.53 0.52 0.51
query6 1.32 0.64 0.61
query7 0.02 0.01 0.01
query8 0.04 0.03 0.03
query9 0.55 0.51 0.49
query10 0.55 0.56 0.57
query11 0.12 0.09 0.09
query12 0.11 0.10 0.09
query13 0.61 0.60 0.60
query14 0.77 0.82 0.81
query15 0.79 0.79 0.78
query16 0.38 0.37 0.39
query17 1.02 1.01 1.02
query18 0.23 0.26 0.25
query19 1.83 1.83 1.79
query20 0.01 0.01 0.01
query21 15.41 0.59 0.59
query22 2.73 2.29 1.31
query23 17.47 0.94 0.70
query24 2.48 1.05 0.91
query25 0.46 0.16 0.16
query26 0.47 0.12 0.13
query27 0.07 0.04 0.06
query28 11.40 0.76 0.74
query29 12.52 3.08 3.19
query30 0.58 0.50 0.50
query31 2.78 0.34 0.34
query32 3.36 0.48 0.48
query33 3.21 3.27 3.23
query34 15.87 4.31 4.27
query35 4.35 4.22 4.30
query36 1.12 1.07 1.08
query37 0.06 0.04 0.04
query38 0.04 0.03 0.03
query39 0.02 0.02 0.02
query40 0.15 0.13 0.13
query41 0.07 0.01 0.01
query42 0.02 0.01 0.02
query43 0.03 0.02 0.02
Total cold run time: 105.56 s
Total hot run time: 30.2 s
TeamCity be ut coverage result: Function Coverage: 36.67% (8655/23601) Line Coverage: 28.72% (70673/246104) Region Coverage: 27.59% (36451/132119) Branch Coverage: 24.31% (18652/76718) Coverage Report: http://coverage.selectdb-in.cc/coverage/f4c53d41c90ac80307af5a3baefc340a7cb87500_f4c53d41c90ac80307af5a3baefc340a7cb87500/report/index.html
Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Load test result on commit f4c53d41c90ac80307af5a3baefc340a7cb87500 with default session variables
Stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc: 58 seconds loaded 1101869774 Bytes, about 18 MB/s
Stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select: 14.6 seconds inserted 10000000 Rows, about 684K ops/s
clang-tidy review says "All clean, LGTM! :+1:"
run buildall