risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

chore(config): make arrangement backfill default

Open kwannoel opened this issue 1 year ago • 4 comments

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Enable only after:

  • [x] https://github.com/risingwavelabs/risingwave/pull/14842
  • [x] https://github.com/risingwavelabs/risingwave/pull/14836
  • [ ] ~And a streaming nexmark bench for main queries.~ Nexmark bench does not contain backfill. We should run backfill bench instead.

Now mainly testing it to make sure recent changes to arrangement backfill did not trigger any regressions.

Checklist

  • [ ] I have written necessary rustdoc comments
  • [ ] I have added necessary unit tests and integration tests
  • [ ] I have added test labels as necessary. See details.
  • [ ] I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features #7934).
  • [ ] My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • [ ] All checks passed in ./risedev check (or alias, ./risedev c)
  • [ ] My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • [ ] My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • [ ] My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

kwannoel avatar Jan 29 '24 12:01 kwannoel

Need to figure out:

  1. Why e2e test will TLE. (Perhaps it's the get_row each time we persist state in debug mode).
  2. Why deterministic test test_high_barrier_latency_cancel takes a long time.
  3. Why recovery test fails.

kwannoel avatar Jan 29 '24 14:01 kwannoel

E2e test Runtime comparison (debug mode)

key: e2e_test/streaming/aggregate/hdr_approx_percentile.slt, diff: +1409, arrangement runtime: 10965, no shuffle runtime: 9556

key: e2e_test/streaming/./nexmark/create_views.slt.part, diff: +4248, arrangement runtime: 19953, no shuffle runtime: 15705

key: e2e_test/streaming/nexmark_snapshot.slt, diff: +5114, arrangement runtime: 22997, no shuffle runtime: 17883

key: e2e_test/streaming/nexmark_upstream.slt, diff: +5226, arrangement runtime: 23449, no shuffle runtime: 18223

key: e2e_test/streaming/over_window/main.slt, diff: +2053, arrangement runtime: 49867, no shuffle runtime: 47814

key: e2e_test/streaming/./tpch/./views/q5.slt.part, diff: +1287, arrangement runtime: 2045, no shuffle runtime: 758

key: e2e_test/streaming/./tpch/./views/q7.slt.part, diff: +1782, arrangement runtime: 2563, no shuffle runtime: 781

key: e2e_test/streaming/./tpch/./views/q8.slt.part, diff: +2189, arrangement runtime: 3556, no shuffle runtime: 1367

key: e2e_test/streaming/./tpch/./views/q9.slt.part, diff: +1991, arrangement runtime: 3312, no shuffle runtime: 1321

key: e2e_test/streaming/./tpch/./views/q10.slt.part, diff: +1858, arrangement runtime: 2875, no shuffle runtime: 1017

key: e2e_test/streaming/./tpch/./views/q11.slt.part, diff: +1844, arrangement runtime: 2828, no shuffle runtime: 984

key: e2e_test/streaming/./tpch/./views/q12.slt.part, diff: +1585, arrangement runtime: 2285, no shuffle runtime: 700

key: e2e_test/streaming/./tpch/./views/q13.slt.part, diff: +1560, arrangement runtime: 2252, no shuffle runtime: 692

key: e2e_test/streaming/./tpch/./views/q14.slt.part, diff: +1561, arrangement runtime: 2263, no shuffle runtime: 702

key: e2e_test/streaming/./tpch/./views/q15.slt.part, diff: +1864, arrangement runtime: 2614, no shuffle runtime: 750

key: e2e_test/streaming/./tpch/./views/q16.slt.part, diff: +2220, arrangement runtime: 3210, no shuffle runtime: 990

key: e2e_test/streaming/./tpch/./views/q17.slt.part, diff: +2165, arrangement runtime: 3464, no shuffle runtime: 1299

key: e2e_test/streaming/./tpch/./views/q18.slt.part, diff: +2399, arrangement runtime: 3713, no shuffle runtime: 1314

key: e2e_test/streaming/./tpch/./views/q19.slt.part, diff: +2096, arrangement runtime: 3086, no shuffle runtime: 990

key: e2e_test/streaming/./tpch/./views/q20.slt.part, diff: +3577, arrangement runtime: 5405, no shuffle runtime: 1828

key: e2e_test/streaming/./tpch/./views/q21.slt.part, diff: +3933, arrangement runtime: 5887, no shuffle runtime: 1954

key: e2e_test/streaming/./tpch/./views/q22.slt.part, diff: +2721, arrangement runtime: 4033, no shuffle runtime: 1312

key: e2e_test/streaming/./tpch/create_views.slt.part, diff: +38380, arrangement runtime: 60467, no shuffle runtime: 22087

key: e2e_test/streaming/./tpch/drop_views.slt.part, diff: +4169, arrangement runtime: 8156, no shuffle runtime: 3987

key: e2e_test/streaming/tpch_snapshot.slt, diff: +45753, arrangement runtime: 77609, no shuffle runtime: 31856

key: e2e_test/streaming/tpch_upstream.slt, diff: +44402, arrangement runtime: 77006, no shuffle runtime: 32604

kwannoel avatar Jan 30 '24 09:01 kwannoel

Screenshot 2024-01-30 at 5 42 44 PM

Runtime still too long without the debug. 8 minutes vs 6 minutes for a normal PR https://buildkite.com/risingwavelabs/pull-request/builds/40961#018d590f-88cb-4868-9ad0-d57ad894bd8d

kwannoel avatar Jan 30 '24 09:01 kwannoel

At least main-cron is not taking a long time for e2e test streaming. Seems to be an issue specific to debug mode. This PR 4m 34s: Screenshot 2024-01-30 at 6 07 51 PM

Main cron without arrangement backfill default 5m: Screenshot 2024-01-30 at 6 17 25 PM

kwannoel avatar Jan 30 '24 10:01 kwannoel

Screenshot 2024-03-14 at 2 39 10 PM

  • https://buildkite.com/risingwave-test/backfill/builds/387
  • https://buildkite.com/risingwave-test/backfill/builds/386

Arrangement backfill passes backfill performance tests.

kwannoel avatar Mar 14 '24 06:03 kwannoel

  • no shuffle w tomb: https://buildkite.com/risingwave-test/backfill/builds/380#018e2f1b-c128-47ef-9cf5-1e5520587b05
  • no shuffle w/o tomb: https://buildkite.com/risingwave-test/backfill/builds/379#018e2ee5-adf4-46a1-8fb9-b6a92e0c2a60
  • arrangement w tomb https://buildkite.com/risingwave-test/backfill/builds/387
  • arrangement no tomb https://buildkite.com/risingwave-test/backfill/builds/386
arrangement no shuffle arrangement w tomb no shuffle w tomb
create_watermark_mv_latency(ms) 124186.916 235615.608 124532.255 213548.740
create_mv_latency(ms) 114223.373 241254.552 110062.274 230168.287
batch_query_latency(ms) 26236.617 24255.724 28296.241 15896.016
mv_query_latency(ms) 21391.008 22818.898 22080.314 16404.857

* tomb refers to tombstone, generated when there's deleted values. An old issue #12680 shows backfill had issues when there's a large number of tombstones.

kwannoel avatar Mar 18 '24 03:03 kwannoel

To revert this PR:

  1. set streaming_use_arrangement_backfill back to false by default.
  2. Run ./risedev dapt to reset stream scan to use backfill rather than arrangement backfill.

kwannoel avatar Mar 19 '24 16:03 kwannoel

Main cron today https://buildkite.com/risingwavelabs/main-cron/builds/2104 seems to be failing as well due to TLE: Screenshot 2024-03-21 at 2 52 11 PM

For this main-cron for this PR: https://buildkite.com/risingwavelabs/main-cron/builds/2108# Importantly the runtime for backfill and e2e test does not regress: Screenshot 2024-03-21 at 2 52 53 PM

In the main cron build above, the timings are 12m26, 15m40s respectively.

kwannoel avatar Mar 21 '24 06:03 kwannoel

Waiting for main-cron fixes https://github.com/risingwavelabs/risingwave/pull/15861, https://github.com/risingwavelabs/risingwave/pull/15908 before merging this in. To make sure no regressions are caused by arrangement backfill.

kwannoel avatar Mar 26 '24 06:03 kwannoel

Fix parallel in memory tests here. https://github.com/risingwavelabs/risingwave/pull/15930

kwannoel avatar Mar 26 '24 18:03 kwannoel