risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Support `DISTINCT ON` for window deduplication

Open TennyZhuang opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe.

Window deduplication is a common demand in streaming processing. I found that PostgreSQL supports DISTINCT ON clause, which can meet the requirement.

Additional context

Currently, we support rewriting the SELECT DISTINCT to aggregation in frontend, and it's same for DISTINCT ON, except that we may need to select more columns in the aggregation.

We should rewrite the following query:

SELECT DISTINCT ON(v1), v3, v4 FROM t1;

to:

SELECT v1, v2, v3 GROUP BY v1, v2;

The latter query is invalid in pg:

ERROR:  column "t.v2" must appear in the GROUP BY clause or be used in an aggregate function

It's because v2 and v3 in the resultset is unpredictable, but at least, they should appear in one row.

We can add a special aggregate function for this, e.g. first_value.

SELECT v1, first_value(v2), first_value(v3) GROUP BY v1, v2;

The trivial first_value implementation can ensure that v2, v3 belong to one row, except that we introduce some special designs, e.g. split the columns in aggregate operator and concat them later.

If there is an ORDER BY clause match the selection list, the result should be predictable. For the case, we can rewrite the query to grouping top-n, but we can forbid the clause first.

TennyZhuang avatar Aug 12 '22 08:08 TennyZhuang