risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Adaptable rate limit for backfill

Open kwannoel opened this issue 11 months ago • 2 comments

From @chenzl25:

Assuming that we have a lot of materialized views in the cluster now, and users rely on the freshness of the materialized views to do batch queries. If the user needs to create a new materialized view at this time, our backfilling default strategy will lead to increased barrier latency, and the freshness of the data queried by the user will be reduced. At present, the strategy we give to users is to control the impact of the entire backfill by adjusting the rate limit, but these parameters are a bit too low-level, so I am wondering if we can expose a more high-level policy for users to choose, that is, backfill that affects barrier latency as little as possible. Technically, we can achieve this through adaptive rate limit adjustment (refer to TCP flow control), or by avoiding the impact of the materialized view being created on the existing materialized view through partial checkpoint.

We can do something like slow start (rate_limit=1 for snapshot read side), and adapt it, based on barrier collection metrics.

Important caveat is that if pressure on MV is from upstream side, only log store / scale out compute resource can help with it. This feature only handles the case where pressure is from snapshot read.

Meta node can control the rate limit for stream job.

Prerequisite: https://github.com/risingwavelabs/risingwave/issues/16113

kwannoel avatar Apr 03 '24 03:04 kwannoel

Waiting for rate limit refactor to happen first.

kwannoel avatar Apr 08 '24 09:04 kwannoel

  • [ ] Implement frontend part
  • [ ] Implement adaptive rate limit within the backfill executor
  • [ ] Add different scenario tests
  • [ ] Add metrics for current adapted rate limit
  • [ ] Add bench between no rate limit and adaptive rate limit. Time taken should be around the same.

kwannoel avatar May 11 '24 01:05 kwannoel