risingwave
risingwave copied to clipboard
Discussion: a case for enabling non-simple local aggregation
Local aggregation was discussed. IIRC, it sends deltas instead of updates(absolute values) to the global aggregation and thus becomes stateless.
One problem is that we don't really know how many distinct group-by keys are in a stream chunk and if the cardinality is high, non-simple local aggregation makes 0 sense. Therefore, currently, we only allow local_simple_agg
as in the case of simple aggregation, we are certain that the output of local aggregation is a single row as there is no group by key. The benefit is huge.
One special case for non-simple local aggregation is that:
- the group key is derived from the window, i.e.
window_start
fromtumble_window
. -
window_start
are mostly the same within a period of time because of the nature of the timestamp on the input data stream. -
window
withXX minute
further makes the phenomenon even more obvious/worse.
If this is the case, only one actor of the global aggregation is effectively working at any given time. This is the reality we have to accept. But non-simple local aggregation becomes very effective as the cardinality of the group-by key is low.
The default setting may remain status quo. May give an option to do non-simple local aggregation when the user is certain it can help.
give an option
for the users to incorporate their insights, together with the help of the optimizer, is a long-term issue.
https://github.com/risingwavelabs/risingwave/issues/3255
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.
This optimization is certainly beneficial for some cases, but before really implementing it,
I am wondering: how do we let users enable/disable this optimization when users have the domain knowledge and want to make the decision?
Session variable? (easy but coarse, may have corner cases) Hints? (require some big effort?)
I think this is a quite common issue for our optimizer. e.g.
- Whether to use two phase aggregation or not?
- Whether to use a different join ordering? (left deep tree or bushy tree or user sql specified order)
- Whether to use delta join or not? ...
With https://github.com/risingwavelabs/risingwave/issues/7014 , we can first provide a session variable for each optimization that is hard to decide without sufficient knowledge. The optimization session variable can have force/enable/disable options and it is enable by default. In this way our user or our console can try to make a combination of all these optimization session variables to create a desired plan. I think using the combination of the global configs like session variables to control the plan is effective in most cases. I remembered the 2021 SIGMOD best Paper, Bao: Making Learned Query Optimization Practical also uses this strategy.
Strong +1 for optimizer hint / session variable. It's very common to try different execution plans or manually control the plans in practice.
how do we let users enable/disable this optimization when users have the domain knowledge and want to make the decision?
- Session variable? (easy but coarse, may have corner cases)
- Hints? (require some big effort?)
PG only provides session variables: https://www.postgresql.org/docs/current/runtime-config-query.html . Personally, I don't like it because it only allows disabling a particular function, instead of telling PG to use a specified access path.
There is a PG plugin pg_hint_plan providing a syntax to control a certain query. I would vote for it between optimizer hint and session variable.
Seems pg_hint_plan can cover some cases that can not be (easily?) covered by the session variables, suppose we have two aggregations in the plan and we want the first one to be two-phase aggregation while the second one to be shuffle-aggregation.
Agree, both the optimizer hint and session variable have their useful scenarios. The optimizer hint provides us with fine-grained control over a query that is useful for developers as well as users looking to fine-tune their plans. A coarse-grained control like session variables might be more practical if we want to also let our console choose different physical plans with predefined optimization combinations for our users, as it is hard to let the console understand SQL and fine-tune the plan.
See https://github.com/risingwavelabs/risingwave/issues/7491