[fix](cloud) Fix cloud warm up balance slow scheduling
What problem does this PR solve?
Currently, when performing tablet warm-up balancing in the cloud, the sequential execution of a single warm-up task leads to a series of problems, such as:
-
When scaling up a computer group to include beta nodes, with a large number of tables (millions of tablets), actual tests showed that scaling from 1 beta node to 10 beta nodes took more than 6 hours to reach a balanced state. Each warm-up task RPC took about 30ms. This means that even if a new node can handle the load, scaling up a new node in the cloud can still take up to 6 hours in the worst case.
-
Due to the same logic, decomission be is also relatively slow.
Fixes:
-
Batch and pipeline warm-up tasks. Each batch can contain multiple warm-up tasks with the same source and destination (each task represents migrating one tablet).
-
Separate the warm-up task finish thread to prevent scheduling logic from affecting the logic that modifies tablet-to-tablet mappings.
-
Asynchronously fetch file cache meta in the warm_up_cache_async logic and add some bvars.
Post-fix testing showed that in a scenario with 10 databases, 10,000 tables, 100,000 partitions, and 1 million tablets, the number of be nodes increased from 3 to 10 within 10 minutes.
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
-
Test
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [x] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
-
Behavior changed:
- [x] No.
- [ ] Yes.
-
Does this need documentation?
- [x] No.
- [ ] Yes.
Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label
run buildall
BE UT Coverage Report
Increment line coverage 0.00% (0/12) :tada:
Increment coverage report Complete coverage report
| Category | Coverage |
|---|---|
| Function Coverage | 53.43% (18826/35238) |
| Line Coverage | 39.19% (174139/444394) |
| Region Coverage | 33.88% (135042/398632) |
| Branch Coverage | 34.78% (58048/166885) |
run buildall
run buildall
TPC-H: Total hot run time: 35190 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 88ab2db752b06c048f7b502e947f31961fa3a44e, data reload: false
------ Round 1 ----------------------------------
q1 17619 4153 4096 4096
q2 2042 362 258 258
q3 10162 1353 750 750
q4 10214 834 309 309
q5 7500 2136 1911 1911
q6 185 171 139 139
q7 1073 864 710 710
q8 9355 1464 1184 1184
q9 7088 5328 5319 5319
q10 6789 2391 1992 1992
q11 562 341 298 298
q12 664 724 581 581
q13 17795 3683 3117 3117
q14 295 292 274 274
q15 615 522 520 520
q16 691 680 641 641
q17 688 749 602 602
q18 7846 7150 7166 7150
q19 1101 970 598 598
q20 411 363 259 259
q21 4214 3977 3513 3513
q22 1072 989 969 969
Total cold run time: 107981 ms
Total hot run time: 35190 ms
----- Round 2, with runtime_filter_mode=off -----
q1 4099 4079 4067 4067
q2 316 406 328 328
q3 2147 2646 2306 2306
q4 1342 1722 1312 1312
q5 4254 4645 4738 4645
q6 238 170 129 129
q7 2031 1931 1874 1874
q8 2738 2586 2608 2586
q9 7669 7814 7453 7453
q10 3123 3232 2819 2819
q11 598 526 491 491
q12 681 737 648 648
q13 3765 3918 3305 3305
q14 320 349 274 274
q15 557 515 517 515
q16 646 672 626 626
q17 1202 1465 1463 1463
q18 7922 7770 7650 7650
q19 951 850 852 850
q20 2091 2056 1937 1937
q21 4928 4311 4161 4161
q22 1091 1044 988 988
Total cold run time: 52709 ms
Total hot run time: 50427 ms
TPC-DS: Total hot run time: 178254 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 88ab2db752b06c048f7b502e947f31961fa3a44e, data reload: false
query5 5783 599 448 448
query6 331 225 211 211
query7 4228 466 283 283
query8 310 247 232 232
query9 8765 2549 2580 2549
query10 552 395 330 330
query11 15691 15191 14531 14531
query12 184 124 117 117
query13 1250 491 389 389
query14 6954 3127 2824 2824
query14_1 2697 2684 2721 2684
query15 224 203 184 184
query16 894 483 454 454
query17 1134 697 566 566
query18 2688 432 335 335
query19 229 227 204 204
query20 122 115 113 113
query21 219 139 117 117
query22 4120 3986 3807 3807
query23 16594 16295 15867 15867
query23_1 15976 16045 15990 15990
query24 7159 1657 1250 1250
query24_1 1239 1248 1281 1248
query25 556 461 417 417
query26 1245 266 161 161
query27 2772 493 320 320
query28 4483 2136 2113 2113
query29 806 549 458 458
query30 319 247 217 217
query31 821 714 633 633
query32 81 70 73 70
query33 533 353 290 290
query34 925 909 531 531
query35 797 837 745 745
query36 867 897 817 817
query37 139 102 83 83
query38 2851 2894 2854 2854
query39 760 746 728 728
query39_1 694 711 694 694
query40 231 140 122 122
query41 66 64 64 64
query42 110 116 110 110
query43 424 443 407 407
query44 1364 761 747 747
query45 200 194 186 186
query46 875 981 625 625
query47 1653 1694 1614 1614
query48 320 332 265 265
query49 619 453 359 359
query50 665 291 217 217
query51 3826 3867 3835 3835
query52 115 115 106 106
query53 333 365 294 294
query54 289 258 255 255
query55 80 81 74 74
query56 305 309 295 295
query57 1143 1127 1076 1076
query58 264 261 269 261
query59 2297 2424 2371 2371
query60 320 308 291 291
query61 162 158 152 152
query62 687 659 628 628
query63 332 291 305 291
query64 4911 1300 1091 1091
query65 4014 3947 3931 3931
query66 1387 443 336 336
query67 15258 14789 14797 14789
query68 2756 1037 760 760
query69 470 365 323 323
query70 1049 1001 960 960
query71 340 311 278 278
query72 6023 5090 4977 4977
query73 477 541 303 303
query74 8826 8860 8576 8576
query75 3125 3177 2821 2821
query76 2831 1139 726 726
query77 358 411 313 313
query78 9539 9677 8858 8858
query79 2514 878 608 608
query80 1628 670 572 572
query81 597 268 235 235
query82 411 133 101 101
query83 365 251 238 238
query84 257 125 110 110
query85 964 512 455 455
query86 475 291 308 291
query87 3009 3063 2982 2982
query88 3320 2266 2278 2266
query89 474 423 392 392
query90 2063 161 154 154
query91 174 179 143 143
query92 76 69 65 65
query93 1385 901 551 551
query94 552 304 283 283
query95 551 327 359 327
query96 593 467 211 211
query97 2295 2352 2236 2236
query98 229 199 192 192
query99 1294 1285 1196 1196
Total cold run time: 257150 ms
Total hot run time: 178254 ms
ClickBench: Total hot run time: 27.52 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 88ab2db752b06c048f7b502e947f31961fa3a44e, data reload: false
query1 0.05 0.05 0.06
query2 0.10 0.04 0.05
query3 0.26 0.09 0.09
query4 1.61 0.12 0.11
query5 0.27 0.26 0.27
query6 1.16 0.65 0.63
query7 0.03 0.03 0.03
query8 0.05 0.04 0.04
query9 0.58 0.51 0.51
query10 0.56 0.56 0.56
query11 0.17 0.11 0.11
query12 0.15 0.12 0.12
query13 0.61 0.59 0.60
query14 1.00 0.98 0.96
query15 0.82 0.80 0.81
query16 0.42 0.42 0.41
query17 1.03 1.08 1.10
query18 0.22 0.21 0.21
query19 1.98 1.77 1.87
query20 0.02 0.02 0.01
query21 15.44 0.29 0.14
query22 4.73 0.05 0.04
query23 15.98 0.28 0.10
query24 1.10 0.41 0.48
query25 0.08 0.08 0.08
query26 0.14 0.14 0.14
query27 0.06 0.04 0.05
query28 4.68 1.24 1.03
query29 12.58 3.97 3.19
query30 0.27 0.14 0.11
query31 2.81 0.62 0.40
query32 3.25 0.55 0.45
query33 2.96 3.05 3.16
query34 16.88 5.23 4.61
query35 4.57 4.51 4.59
query36 0.64 0.49 0.49
query37 0.11 0.07 0.06
query38 0.07 0.04 0.04
query39 0.05 0.04 0.04
query40 0.18 0.15 0.13
query41 0.09 0.03 0.02
query42 0.04 0.03 0.03
query43 0.03 0.03 0.04
Total cold run time: 97.83 s
Total hot run time: 27.52 s
BE UT Coverage Report
Increment line coverage 0.00% (0/12) :tada:
Increment coverage report Complete coverage report
| Category | Coverage |
|---|---|
| Function Coverage | 53.38% (18821/35258) |
| Line Coverage | 39.17% (174221/444816) |
| Region Coverage | 33.75% (134901/399660) |
| Branch Coverage | 34.64% (58081/167653) |
BE Regression && UT Coverage Report
Increment line coverage 10.38% (11/106) :tada:
Increment coverage report Complete coverage report
| Category | Coverage |
|---|---|
| Function Coverage | 72.21% (24950/34553) |
| Line Coverage | 58.96% (261980/444310) |
| Region Coverage | 53.87% (217782/404297) |
| Branch Coverage | 55.36% (93269/168466) |
FE Regression Coverage Report
Increment line coverage 32.47% (63/194) :tada:
Increment coverage report
Complete coverage report
PR approved by at least one committer and no changes requested.
PR approved by anyone and no changes requested.