risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Discussion: a case for enabling non-simple local aggregation

Open lmatz opened this issue 2 years ago • 1 comments

Local aggregation was discussed. IIRC, it sends deltas instead of updates(absolute values) to the global aggregation and thus becomes stateless.

One problem is that we don't really know how many distinct group-by keys are in a stream chunk and if the cardinality is high, non-simple local aggregation makes 0 sense. Therefore, currently, we only allow local_simple_agg as in the case of simple aggregation, we are certain that the output of local aggregation is a single row as there is no group by key. The benefit is huge.

One special case for non-simple local aggregation is that:

  1. the group key is derived from the window, i.e. window_start from tumble_window.
  2. window_start are mostly the same within a period of time because of the nature of the timestamp on the input data stream.
  3. window with XX minute further makes the phenomenon even more obvious/worse.

If this is the case, only one actor of the global aggregation is effectively working at any given time. This is the reality we have to accept. But non-simple local aggregation becomes very effective as the cardinality of the group-by key is low.

The default setting may remain status quo. May give an option to do non-simple local aggregation when the user is certain it can help.

give an option for the users to incorporate their insights, together with the help of the optimizer, is a long-term issue.

lmatz avatar Sep 22 '22 05:09 lmatz

https://github.com/risingwavelabs/risingwave/issues/3255

st1page avatar Sep 22 '22 05:09 st1page

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

github-actions[bot] avatar Nov 22 '22 02:11 github-actions[bot]

This optimization is certainly beneficial for some cases, but before really implementing it,

I am wondering: how do we let users enable/disable this optimization when users have the domain knowledge and want to make the decision?

Session variable? (easy but coarse, may have corner cases) Hints? (require some big effort?)

lmatz avatar Nov 22 '22 03:11 lmatz

I think this is a quite common issue for our optimizer. e.g.

  • Whether to use two phase aggregation or not?
  • Whether to use a different join ordering? (left deep tree or bushy tree or user sql specified order)
  • Whether to use delta join or not? ...

With https://github.com/risingwavelabs/risingwave/issues/7014 , we can first provide a session variable for each optimization that is hard to decide without sufficient knowledge. The optimization session variable can have force/enable/disable options and it is enable by default. In this way our user or our console can try to make a combination of all these optimization session variables to create a desired plan. I think using the combination of the global configs like session variables to control the plan is effective in most cases. I remembered the 2021 SIGMOD best Paper, Bao: Making Learned Query Optimization Practical also uses this strategy.

chenzl25 avatar Dec 28 '22 10:12 chenzl25

Strong +1 for optimizer hint / session variable. It's very common to try different execution plans or manually control the plans in practice.

how do we let users enable/disable this optimization when users have the domain knowledge and want to make the decision?

  • Session variable? (easy but coarse, may have corner cases)
  • Hints? (require some big effort?)

PG only provides session variables: https://www.postgresql.org/docs/current/runtime-config-query.html . Personally, I don't like it because it only allows disabling a particular function, instead of telling PG to use a specified access path.

There is a PG plugin pg_hint_plan providing a syntax to control a certain query. I would vote for it between optimizer hint and session variable.

fuyufjh avatar Dec 29 '22 08:12 fuyufjh

Seems pg_hint_plan can cover some cases that can not be (easily?) covered by the session variables, suppose we have two aggregations in the plan and we want the first one to be two-phase aggregation while the second one to be shuffle-aggregation.

lmatz avatar Dec 29 '22 08:12 lmatz

Agree, both the optimizer hint and session variable have their useful scenarios. The optimizer hint provides us with fine-grained control over a query that is useful for developers as well as users looking to fine-tune their plans. A coarse-grained control like session variables might be more practical if we want to also let our console choose different physical plans with predefined optimization combinations for our users, as it is hard to let the console understand SQL and fine-tune the plan.

chenzl25 avatar Dec 29 '22 09:12 chenzl25

See https://github.com/risingwavelabs/risingwave/issues/7491

fuyufjh avatar Jan 30 '23 03:01 fuyufjh