rfcs RFC: lambda expression

use lambda expression to enhance array's complex processing

Aug 02 '23 03:08 st1page

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. https://github.com/risingwavelabs/risingwave/pull/11390. If we support the following two optimizer rules later, we can resolve this issue.

transposing Apply with TableFunction which I think it can be reduced into ProjectSet + Value in the future.
reducing trivial nested-loop join generated by subquery unnesting.

Aug 02 '23 10:08 chenzl25

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. https://github.com/risingwavelabs/risingwave/pull/11390. If we support the following two optimizer rules later, we can resolve this issue.

LGTM, but still not the best because it introduce unnessary Join operator 🤔

Aug 03 '23 05:08 st1page

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390. If we support the following two optimizer rules later, we can resolve this issue.

LGTM, but still not the best because it introduce unnessary Join operator 🤔

True. It could be a temporal solution. From a performance perspective, this RFC could do better.

Aug 03 '23 05:08 chenzl25

The latest progress:

TL;DR

We use |x| x * 2 syntax for lambda function MVP.
The array_reduce is very heavy in our system, so we implement the utility array functions array_{sum|min|max} first.

The lambda syntax

In this RFC, we proposed using the arrow symbol “->” as the syntax for lambda expressions, such as x -> x * 2, since it's widely supported by different systems. However, this symbol conflicts with JsonAccess and creates significant ambiguity.

A bad case: how should f(ARRAY[1,2,3], x -> 'y' or z) be parsed? (x -> 'y') or z or x -> ('y' or z)? The AST will be much different after the parser phase.

We also proposed the => symbol, which is used by JavaScript and also widely accepted by programmers. Unfortunately, it's still conflict with a pg syntax, Named notation.

-- Lambda function
SELECT f(ARRAY[1,2,3], x => x * 2);
-- Named notation
SELECT abs(a => -1)
-- ???
SELECT f(apply => (x => x * 2), array => ARRAY[1,2,3]);

There are definitely many compiler techniques that can solve the problem of ambiguity in the above syntax. After all, it is much simpler compared to C++ syntax. However, as an experimental feature, we do not want to introduce very complex refactors into the parser for it. We have decided to look for a syntax that is not ambiguous and relatively simple to support this feature. The final choice is a Rust-like syntax |x| x * 2.

In the current pg 15, there is no built-in operator that allows the prefix | operator. It may conflict with custom operators, but currently we do not support the feature. Even if we support it, it 's still acceptable to forbid “|”” as a unary operator.

Of course, another important reason is that we like Rust.

Choosing the Rust-like syntax in the MVP version does not mean we will not change, but because the cost of implementing it is very low, even if we eventually choose another syntax, we can simply deprecate the rust-like one and maintain compatibility with it forever without much effort.

If the PostgreSQL eventually support the lambda function using another syntax, or our users really like the -> symbol such as duckdb or prestodb, we can support that (with a huge refactor in the parser module).

The `reduce` performance issue

The RFC also proposed a reduce function to create an aggregate function or a scalar function which aggregates an array. Unfortunately, the implementation is heavy since we can't pass a column-based chunk to the lambda function.

-- Calculate the sum of an array
select array_reduce(ARRAY[1,2,3], (x, y) => x + y);

The lambda arg will eventually be converted to a BoxedExpression in the backend. However, we cannot invoke the eval(DataChunk) method and can only choose the inefficient eval_row(OwnedRow) method instead. The accumulation logic is very unfriendly to chunk-based evaluation.

There many overheads when evaluating one row.

The chained dynamic dispatch Box<dyn Expression>
Box and unbox of OwnedRow and Datum.

The overhead is much heavy, and the only solution is to support query compilation for expression, which is equivalent to fully rewrite the expression framework.

As a conclusion, even if we supported array_reduce, we still need to implement the array_{sum|min|max|etc} to optimize the performance. So we decide to implement the utility functions first instead of introducing array_reduce.

The bridge between aggregate functions and array scalar functions

The logic are almost identical for the scalar function array_sum and the aggregate function sum. Should we call the sum batch aggregator directly when implementing array_sum?

My personal opinion is to first try to implement array_sum manually, observe the commonalities, and then draw conclusions.

Sep 01 '23 03:09 TennyZhuang

It is possible to introduce another macro to generate array scalar functions from aggregate definitions. It may look like:

#[aggregate("sum(int64) -> int64")]                // original aggregate
#[array_function("array_sum(int64[]) -> int64")]   // the new macro
fn sum(state: i64, input: i64) -> i64 {
    state + input
}

This approach requires no extra code for each function, instead it puts all complexity into the macro.

By contrast, manually implement each array_* function is a bit tedious but trivial:

#[function("array_sum(int64[]) -> int64")]
fn array_sum(array: ListRef<'_>) -> i64 {
    array.as_int64().iter().sum()
}

So I would also prefer manual implementing each array function in the first step.

Sep 01 '23 04:09 wangrunji0408

Just ran into this: https://www.postgresql.org/docs/current/intarray.html

Jan 31 '24 06:01 xiangjinwu

RFC: lambda expression

The lambda syntax

The reduce performance issue

The bridge between aggregate functions and array scalar functions

The `reduce` performance issue