risingwave
risingwave copied to clipboard
chore(config): make arrangement backfill default
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Enable only after:
- [x] https://github.com/risingwavelabs/risingwave/pull/14842
- [x] https://github.com/risingwavelabs/risingwave/pull/14836
- [ ] ~And a streaming nexmark bench for main queries.~ Nexmark bench does not contain backfill. We should run backfill bench instead.
Now mainly testing it to make sure recent changes to arrangement backfill did not trigger any regressions.
Checklist
- [ ] I have written necessary rustdoc comments
- [ ] I have added necessary unit tests and integration tests
- [ ] I have added test labels as necessary. See details.
- [ ] I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features #7934).
- [ ] My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
- [ ] All checks passed in
./risedev check
(or alias,./risedev c
) - [ ] My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
- [ ] My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)
Documentation
- [ ] My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.
Need to figure out:
- Why e2e test will TLE. (Perhaps it's the
get_row
each time we persist state in debug mode). - Why deterministic test
test_high_barrier_latency_cancel
takes a long time. - Why recovery test fails.
E2e test Runtime comparison (debug mode)
key: e2e_test/streaming/aggregate/hdr_approx_percentile.slt, diff: +1409, arrangement runtime: 10965, no shuffle runtime: 9556
key: e2e_test/streaming/./nexmark/create_views.slt.part, diff: +4248, arrangement runtime: 19953, no shuffle runtime: 15705
key: e2e_test/streaming/nexmark_snapshot.slt, diff: +5114, arrangement runtime: 22997, no shuffle runtime: 17883
key: e2e_test/streaming/nexmark_upstream.slt, diff: +5226, arrangement runtime: 23449, no shuffle runtime: 18223
key: e2e_test/streaming/over_window/main.slt, diff: +2053, arrangement runtime: 49867, no shuffle runtime: 47814
key: e2e_test/streaming/./tpch/./views/q5.slt.part, diff: +1287, arrangement runtime: 2045, no shuffle runtime: 758
key: e2e_test/streaming/./tpch/./views/q7.slt.part, diff: +1782, arrangement runtime: 2563, no shuffle runtime: 781
key: e2e_test/streaming/./tpch/./views/q8.slt.part, diff: +2189, arrangement runtime: 3556, no shuffle runtime: 1367
key: e2e_test/streaming/./tpch/./views/q9.slt.part, diff: +1991, arrangement runtime: 3312, no shuffle runtime: 1321
key: e2e_test/streaming/./tpch/./views/q10.slt.part, diff: +1858, arrangement runtime: 2875, no shuffle runtime: 1017
key: e2e_test/streaming/./tpch/./views/q11.slt.part, diff: +1844, arrangement runtime: 2828, no shuffle runtime: 984
key: e2e_test/streaming/./tpch/./views/q12.slt.part, diff: +1585, arrangement runtime: 2285, no shuffle runtime: 700
key: e2e_test/streaming/./tpch/./views/q13.slt.part, diff: +1560, arrangement runtime: 2252, no shuffle runtime: 692
key: e2e_test/streaming/./tpch/./views/q14.slt.part, diff: +1561, arrangement runtime: 2263, no shuffle runtime: 702
key: e2e_test/streaming/./tpch/./views/q15.slt.part, diff: +1864, arrangement runtime: 2614, no shuffle runtime: 750
key: e2e_test/streaming/./tpch/./views/q16.slt.part, diff: +2220, arrangement runtime: 3210, no shuffle runtime: 990
key: e2e_test/streaming/./tpch/./views/q17.slt.part, diff: +2165, arrangement runtime: 3464, no shuffle runtime: 1299
key: e2e_test/streaming/./tpch/./views/q18.slt.part, diff: +2399, arrangement runtime: 3713, no shuffle runtime: 1314
key: e2e_test/streaming/./tpch/./views/q19.slt.part, diff: +2096, arrangement runtime: 3086, no shuffle runtime: 990
key: e2e_test/streaming/./tpch/./views/q20.slt.part, diff: +3577, arrangement runtime: 5405, no shuffle runtime: 1828
key: e2e_test/streaming/./tpch/./views/q21.slt.part, diff: +3933, arrangement runtime: 5887, no shuffle runtime: 1954
key: e2e_test/streaming/./tpch/./views/q22.slt.part, diff: +2721, arrangement runtime: 4033, no shuffle runtime: 1312
key: e2e_test/streaming/./tpch/create_views.slt.part, diff: +38380, arrangement runtime: 60467, no shuffle runtime: 22087
key: e2e_test/streaming/./tpch/drop_views.slt.part, diff: +4169, arrangement runtime: 8156, no shuffle runtime: 3987
key: e2e_test/streaming/tpch_snapshot.slt, diff: +45753, arrangement runtime: 77609, no shuffle runtime: 31856
key: e2e_test/streaming/tpch_upstream.slt, diff: +44402, arrangement runtime: 77006, no shuffle runtime: 32604
Runtime still too long without the debug. 8 minutes vs 6 minutes for a normal PR https://buildkite.com/risingwavelabs/pull-request/builds/40961#018d590f-88cb-4868-9ad0-d57ad894bd8d
At least main-cron is not taking a long time for e2e test streaming. Seems to be an issue specific to debug mode.
This PR 4m 34s
:
Main cron without arrangement backfill default 5m
:
- https://buildkite.com/risingwave-test/backfill/builds/387
- https://buildkite.com/risingwave-test/backfill/builds/386
Arrangement backfill passes backfill performance tests.
- no shuffle w tomb: https://buildkite.com/risingwave-test/backfill/builds/380#018e2f1b-c128-47ef-9cf5-1e5520587b05
- no shuffle w/o tomb: https://buildkite.com/risingwave-test/backfill/builds/379#018e2ee5-adf4-46a1-8fb9-b6a92e0c2a60
- arrangement w tomb https://buildkite.com/risingwave-test/backfill/builds/387
- arrangement no tomb https://buildkite.com/risingwave-test/backfill/builds/386
arrangement | no shuffle | arrangement w tomb | no shuffle w tomb | |
---|---|---|---|---|
create_watermark_mv_latency(ms) | 124186.916 | 235615.608 | 124532.255 | 213548.740 |
create_mv_latency(ms) | 114223.373 | 241254.552 | 110062.274 | 230168.287 |
batch_query_latency(ms) | 26236.617 | 24255.724 | 28296.241 | 15896.016 |
mv_query_latency(ms) | 21391.008 | 22818.898 | 22080.314 | 16404.857 |
* tomb refers to tombstone, generated when there's deleted values. An old issue #12680 shows backfill had issues when there's a large number of tombstones.
To revert this PR:
- set
streaming_use_arrangement_backfill
back to false by default. - Run
./risedev dapt
to reset stream scan to use backfill rather than arrangement backfill.
Main cron today https://buildkite.com/risingwavelabs/main-cron/builds/2104 seems to be failing as well due to TLE:
For this main-cron for this PR: https://buildkite.com/risingwavelabs/main-cron/builds/2108#
Importantly the runtime for backfill and e2e test does not regress:
In the main cron build above, the timings are 12m26, 15m40s respectively.
Waiting for main-cron fixes https://github.com/risingwavelabs/risingwave/pull/15861, https://github.com/risingwavelabs/risingwave/pull/15908 before merging this in. To make sure no regressions are caused by arrangement backfill.
Fix parallel in memory tests here. https://github.com/risingwavelabs/risingwave/pull/15930