RFC: lambda expression
use lambda expression to enhance array's complex processing
I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. https://github.com/risingwavelabs/risingwave/pull/11390. If we support the following two optimizer rules later, we can resolve this issue.
- transposing Apply with TableFunction which I think it can be reduced into ProjectSet + Value in the future.
- reducing trivial nested-loop join generated by subquery unnesting.
I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. https://github.com/risingwavelabs/risingwave/pull/11390. If we support the following two optimizer rules later, we can resolve this issue.
LGTM, but still not the best because it introduce unnessary Join operator 🤔
I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390. If we support the following two optimizer rules later, we can resolve this issue.
LGTM, but still not the best because it introduce unnessary Join operator 🤔
True. It could be a temporal solution. From a performance perspective, this RFC could do better.
The latest progress:
TL;DR
- We use
|x| x * 2syntax for lambda function MVP. - The
array_reduceis very heavy in our system, so we implement the utility array functionsarray_{sum|min|max}first.
The lambda syntax
In this RFC, we proposed using the arrow symbol “->” as the syntax for lambda expressions, such as x -> x * 2, since it's widely supported by different systems. However, this symbol conflicts with JsonAccess and creates significant ambiguity.
A bad case: how should f(ARRAY[1,2,3], x -> 'y' or z) be parsed? (x -> 'y') or z or x -> ('y' or z)? The AST will be much different after the parser phase.
We also proposed the => symbol, which is used by JavaScript and also widely accepted by programmers. Unfortunately, it's still conflict with a pg syntax, Named notation.
-- Lambda function
SELECT f(ARRAY[1,2,3], x => x * 2);
-- Named notation
SELECT abs(a => -1)
-- ???
SELECT f(apply => (x => x * 2), array => ARRAY[1,2,3]);
There are definitely many compiler techniques that can solve the problem of ambiguity in the above syntax. After all, it is much simpler compared to C++ syntax. However, as an experimental feature, we do not want to introduce very complex refactors into the parser for it. We have decided to look for a syntax that is not ambiguous and relatively simple to support this feature. The final choice is a Rust-like syntax |x| x * 2.
In the current pg 15, there is no built-in operator that allows the prefix | operator. It may conflict with custom operators, but currently we do not support the feature. Even if we support it, it 's still acceptable to forbid “|”” as a unary operator.
Of course, another important reason is that we like Rust.
Choosing the Rust-like syntax in the MVP version does not mean we will not change, but because the cost of implementing it is very low, even if we eventually choose another syntax, we can simply deprecate the rust-like one and maintain compatibility with it forever without much effort.
If the PostgreSQL eventually support the lambda function using another syntax, or our users really like the -> symbol such as duckdb or prestodb, we can support that (with a huge refactor in the parser module).
The reduce performance issue
The RFC also proposed a reduce function to create an aggregate function or a scalar function which aggregates an array. Unfortunately, the implementation is heavy since we can't pass a column-based chunk to the lambda function.
-- Calculate the sum of an array
select array_reduce(ARRAY[1,2,3], (x, y) => x + y);
The lambda arg will eventually be converted to a BoxedExpression in the backend. However, we cannot invoke the eval(DataChunk) method and can only choose the inefficient eval_row(OwnedRow) method instead. The accumulation logic is very unfriendly to chunk-based evaluation.
There many overheads when evaluating one row.
- The chained dynamic dispatch
Box<dyn Expression> - Box and unbox of
OwnedRowandDatum.
The overhead is much heavy, and the only solution is to support query compilation for expression, which is equivalent to fully rewrite the expression framework.
As a conclusion, even if we supported array_reduce, we still need to implement the array_{sum|min|max|etc} to optimize the performance. So we decide to implement the utility functions first instead of introducing array_reduce.
The bridge between aggregate functions and array scalar functions
The logic are almost identical for the scalar function array_sum and the aggregate function sum. Should we call the sum batch aggregator directly when implementing array_sum?
My personal opinion is to first try to implement array_sum manually, observe the commonalities, and then draw conclusions.
It is possible to introduce another macro to generate array scalar functions from aggregate definitions. It may look like:
#[aggregate("sum(int64) -> int64")] // original aggregate
#[array_function("array_sum(int64[]) -> int64")] // the new macro
fn sum(state: i64, input: i64) -> i64 {
state + input
}
This approach requires no extra code for each function, instead it puts all complexity into the macro.
By contrast, manually implement each array_* function is a bit tedious but trivial:
#[function("array_sum(int64[]) -> int64")]
fn array_sum(array: ListRef<'_>) -> i64 {
array.as_int64().iter().sum()
}
So I would also prefer manual implementing each array function in the first step.
Just ran into this: https://www.postgresql.org/docs/current/intarray.html