risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

feat(optimizer): support basic expression simplify rule

Open xzhseh opened this issue 1 year ago • 3 comments

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

related #14133.

Checklist

  • [x] I have written necessary rustdoc comments
  • [ ] I have added necessary unit tests and integration tests
  • [ ] I have added test labels as necessary. See details.
  • [ ] I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features #7934).
  • [ ] My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • [x] All checks passed in ./risedev check (or alias, ./risedev c)
  • [ ] My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • [ ] My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • [ ] My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

xzhseh avatar Feb 26 '24 19:02 xzhseh

P.S. It's strange that only in StreamFilter (presumably in LogicalShare) will these kind of predicate exist, so the current implementation is essentially checking the input in LogicalShare, I have no idea why this is the case at present.

xzhseh avatar Feb 27 '24 02:02 xzhseh

We have had some expressions simplify logic here and it could be called every time a new Condition is constructed. https://github.com/risingwavelabs/risingwave/blob/d9228fcb4681206813ea7ea943991d33c69a1f5d/src/frontend/src/utils/condition.rs#L845

I am not sure which one is better to organize the logic, implement them as a optimization rule or implement in Conditon.. but maybe we should maintain them in the same place 🤔

st1page avatar Feb 27 '24 05:02 st1page

We have had some expressions simplify logic here and it could be called every time a new Condition is constructed.

https://github.com/risingwavelabs/risingwave/blob/d9228fcb4681206813ea7ea943991d33c69a1f5d/src/frontend/src/utils/condition.rs#L845

I am not sure which one is better to organize the logic, implement them as a optimization rule or implement in Conditon.. but maybe we should maintain them in the same place 🤔

Thanks for pointing this out! I haven't noticed this before and I'll take a look tomorrow. 😄

xzhseh avatar Feb 27 '24 05:02 xzhseh

The essential problem for expression simplification in general at present, is that we can't ensure whether to transform the underlying expression to if Strong::is_null returns false.

As an example, for expression like Not(c1 > 1) or (c1 > 1), we could easily transform this to IsNotNull(c1) though Strong::is_null returns false.

As an general extension, for expression like Not(c1 > 1 and c2 > 2 and ... and cn > n) or (<same expr>), we could also transform this to IsNotNull(c1) and IsNotNull(c2) and ... and IsNotNull(cn).

But what if the underlying expression is like, e.g., Not((c1 > 1 and c2 > 2 ...) or True) or (<same expr>). In this case, we can't simply transform to the above form, since null or true would eventually evaluate to true, even if any column is null, and doing so would lead to semantic violation.

I don't quite sure if constant folding will help in such case, and there may exist many other likewise cases especially when query becomes complicated.

The ultimate goal is to [preserve the semantic] / [ensure the correctness] when simplifying the expression, cc @chenzl25 @st1page @xiangjinwu.

Any suggestion would be appreciated!

xzhseh avatar Mar 03 '24 17:03 xzhseh

But what if the underlying expression is like, e.g., Not((c1 > 1 and c2 > 2 ...) or True) or (<same expr>). In this case, we can't simply transform to the above form, since null or true would eventually evaluate to true, even if any column is null, and doing so would lead to semantic violation.

I don't quite sure if constant folding will help in such case, and there may exist many other likewise cases especially when query becomes complicated.

I've thought about it carefully and found that there are indeed many Croner cases here. 🥵

select * from nt;
 c1 | c2
----+----
    |
    |  1
 10 |  1
 10 |

select * from nt where  Not(c1 > 1 and c2 > 1) or (c1 > 1 and c2 > 1);
 c1 | c2
----+----
    |  1
 10 |  1

select * from nt where  c1 is not null and c2 is not null;
 c1 | c2
----+----
 10 |  1

chenzl25 avatar Mar 04 '24 03:03 chenzl25

Unfortunately according to the second case, my previous imagination is actually wrong, i.e.,

As an general extension, for expression like Not(c1 > 1 and c2 > 2 and ... and cn > n) or (), we could also transform this to IsNotNull(c1) and IsNotNull(c2) and ... and IsNotNull(cn).

This happens is because, Not(c1 > 1 and c2 > 1) will evaluate to True when c2 = 1; c1 = null, and the overall expression will then be True. (during runtime)

Since it's impossible to make any prediction of the column value during optimizing phase, this kind of optimization seems impossible. 😕

Should we then change this simplification only to constant or just the cases involved a single column?

xzhseh avatar Mar 04 '24 04:03 xzhseh

Unfortunately according to the second case, my previous imagination is actually wrong, i.e.,

As an general extension, for expression like Not(c1 > 1 and c2 > 2 and ... and cn > n) or (), we could also transform this to IsNotNull(c1) and IsNotNull(c2) and ... and IsNotNull(cn).

This happens is because, Not(c1 > 1 and c2 > 1) will evaluate to True when c2 = 1; c1 = null, and the overall expression will then be True. (during runtime)

Since it's impossible to make any prediction of the column value during optimizing phase, this kind of optimization seems impossible. 😕

Should we then change this simplification only to constant or just the cases involved a single column?

Yes, I think that is why PG doesn't simplify these expressions. We can just simplify the cases involved a single column.

chenzl25 avatar Mar 04 '24 05:03 chenzl25

Now we only consider e that contains at most one column as the optimizable pattern, in Not(e) [op] e, if the column exists, it will be converted to IsNotNull(col) under specific context. Two special patterns will be checked and try elimating first, e.g., true or (...) => true, false and (...) => false. cc @chenzl25, @st1page.

xzhseh avatar Mar 13 '24 20:03 xzhseh