risingwave chore(config): make arrangement backfill default

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Enable only after:

[x] https://github.com/risingwavelabs/risingwave/pull/14842
[x] https://github.com/risingwavelabs/risingwave/pull/14836
[ ] ~And a streaming nexmark bench for main queries.~ Nexmark bench does not contain backfill. We should run backfill bench instead.

Now mainly testing it to make sure recent changes to arrangement backfill did not trigger any regressions.

Checklist

[ ] I have written necessary rustdoc comments
[ ] I have added necessary unit tests and integration tests
[ ] I have added test labels as necessary. See details.
[ ] I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features #7934).
[ ] My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
[ ] All checks passed in ./risedev check (or alias, ./risedev c)
[ ] My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

[ ] My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

[ ] My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Jan 29 '24 12:01 kwannoel

Need to figure out:

Why e2e test will TLE. (Perhaps it's the get_row each time we persist state in debug mode).
Why deterministic test test_high_barrier_latency_cancel takes a long time.
Why recovery test fails.

Jan 29 '24 14:01 kwannoel

E2e test Runtime comparison (debug mode)

key: e2e_test/streaming/aggregate/hdr_approx_percentile.slt, diff: +1409, arrangement runtime: 10965, no shuffle runtime: 9556

key: e2e_test/streaming/./nexmark/create_views.slt.part, diff: +4248, arrangement runtime: 19953, no shuffle runtime: 15705

key: e2e_test/streaming/nexmark_snapshot.slt, diff: +5114, arrangement runtime: 22997, no shuffle runtime: 17883

key: e2e_test/streaming/nexmark_upstream.slt, diff: +5226, arrangement runtime: 23449, no shuffle runtime: 18223

key: e2e_test/streaming/over_window/main.slt, diff: +2053, arrangement runtime: 49867, no shuffle runtime: 47814

key: e2e_test/streaming/./tpch/./views/q5.slt.part, diff: +1287, arrangement runtime: 2045, no shuffle runtime: 758

key: e2e_test/streaming/./tpch/./views/q7.slt.part, diff: +1782, arrangement runtime: 2563, no shuffle runtime: 781

key: e2e_test/streaming/./tpch/./views/q8.slt.part, diff: +2189, arrangement runtime: 3556, no shuffle runtime: 1367

key: e2e_test/streaming/./tpch/./views/q9.slt.part, diff: +1991, arrangement runtime: 3312, no shuffle runtime: 1321

key: e2e_test/streaming/./tpch/./views/q10.slt.part, diff: +1858, arrangement runtime: 2875, no shuffle runtime: 1017

key: e2e_test/streaming/./tpch/./views/q11.slt.part, diff: +1844, arrangement runtime: 2828, no shuffle runtime: 984

key: e2e_test/streaming/./tpch/./views/q12.slt.part, diff: +1585, arrangement runtime: 2285, no shuffle runtime: 700

key: e2e_test/streaming/./tpch/./views/q13.slt.part, diff: +1560, arrangement runtime: 2252, no shuffle runtime: 692

key: e2e_test/streaming/./tpch/./views/q14.slt.part, diff: +1561, arrangement runtime: 2263, no shuffle runtime: 702

key: e2e_test/streaming/./tpch/./views/q15.slt.part, diff: +1864, arrangement runtime: 2614, no shuffle runtime: 750

key: e2e_test/streaming/./tpch/./views/q16.slt.part, diff: +2220, arrangement runtime: 3210, no shuffle runtime: 990

key: e2e_test/streaming/./tpch/./views/q17.slt.part, diff: +2165, arrangement runtime: 3464, no shuffle runtime: 1299

key: e2e_test/streaming/./tpch/./views/q18.slt.part, diff: +2399, arrangement runtime: 3713, no shuffle runtime: 1314

key: e2e_test/streaming/./tpch/./views/q19.slt.part, diff: +2096, arrangement runtime: 3086, no shuffle runtime: 990

key: e2e_test/streaming/./tpch/./views/q20.slt.part, diff: +3577, arrangement runtime: 5405, no shuffle runtime: 1828

key: e2e_test/streaming/./tpch/./views/q21.slt.part, diff: +3933, arrangement runtime: 5887, no shuffle runtime: 1954

key: e2e_test/streaming/./tpch/./views/q22.slt.part, diff: +2721, arrangement runtime: 4033, no shuffle runtime: 1312

key: e2e_test/streaming/./tpch/create_views.slt.part, diff: +38380, arrangement runtime: 60467, no shuffle runtime: 22087

key: e2e_test/streaming/./tpch/drop_views.slt.part, diff: +4169, arrangement runtime: 8156, no shuffle runtime: 3987

key: e2e_test/streaming/tpch_snapshot.slt, diff: +45753, arrangement runtime: 77609, no shuffle runtime: 31856

key: e2e_test/streaming/tpch_upstream.slt, diff: +44402, arrangement runtime: 77006, no shuffle runtime: 32604

Jan 30 '24 09:01 kwannoel

Screenshot 2024-01-30 at 5 42 44 PM

Runtime still too long without the debug. 8 minutes vs 6 minutes for a normal PR https://buildkite.com/risingwavelabs/pull-request/builds/40961#018d590f-88cb-4868-9ad0-d57ad894bd8d

Jan 30 '24 09:01 kwannoel

At least main-cron is not taking a long time for e2e test streaming. Seems to be an issue specific to debug mode. This PR 4m 34s: Screenshot 2024-01-30 at 6 07 51 PM

Main cron without arrangement backfill default 5m: Screenshot 2024-01-30 at 6 17 25 PM

Jan 30 '24 10:01 kwannoel

Screenshot 2024-03-14 at 2 39 10 PM

https://buildkite.com/risingwave-test/backfill/builds/387
https://buildkite.com/risingwave-test/backfill/builds/386

Arrangement backfill passes backfill performance tests.

Mar 14 '24 06:03 kwannoel

no shuffle w tomb: https://buildkite.com/risingwave-test/backfill/builds/380#018e2f1b-c128-47ef-9cf5-1e5520587b05
no shuffle w/o tomb: https://buildkite.com/risingwave-test/backfill/builds/379#018e2ee5-adf4-46a1-8fb9-b6a92e0c2a60
arrangement w tomb https://buildkite.com/risingwave-test/backfill/builds/387
arrangement no tomb https://buildkite.com/risingwave-test/backfill/builds/386

	arrangement	no shuffle	arrangement w tomb	no shuffle w tomb
create_watermark_mv_latency(ms)	124186.916	235615.608	124532.255	213548.740
create_mv_latency(ms)	114223.373	241254.552	110062.274	230168.287
batch_query_latency(ms)	26236.617	24255.724	28296.241	15896.016
mv_query_latency(ms)	21391.008	22818.898	22080.314	16404.857

* tomb refers to tombstone, generated when there's deleted values. An old issue #12680 shows backfill had issues when there's a large number of tombstones.

Mar 18 '24 03:03 kwannoel

To revert this PR:

set streaming_use_arrangement_backfill back to false by default.
Run ./risedev dapt to reset stream scan to use backfill rather than arrangement backfill.

Mar 19 '24 16:03 kwannoel

Main cron today https://buildkite.com/risingwavelabs/main-cron/builds/2104 seems to be failing as well due to TLE: Screenshot 2024-03-21 at 2 52 11 PM

For this main-cron for this PR: https://buildkite.com/risingwavelabs/main-cron/builds/2108# Importantly the runtime for backfill and e2e test does not regress: Screenshot 2024-03-21 at 2 52 53 PM

In the main cron build above, the timings are 12m26, 15m40s respectively.

Mar 21 '24 06:03 kwannoel

Waiting for main-cron fixes https://github.com/risingwavelabs/risingwave/pull/15861, https://github.com/risingwavelabs/risingwave/pull/15908 before merging this in. To make sure no regressions are caused by arrangement backfill.

Mar 26 '24 06:03 kwannoel

Fix parallel in memory tests here. https://github.com/risingwavelabs/risingwave/pull/15930

Mar 26 '24 18:03 kwannoel

risingwave risingwave copied to clipboard

chore(config): make arrangement backfill default

What's changed and what's your intention?

Checklist

Documentation

Release note

risingwave
risingwave copied to clipboard