zed icon indicating copy to clipboard operation
zed copied to clipboard

Window functions

Open philrz opened this issue 9 months ago • 1 comments

SuperDB is not expected to have support for Window Functions (as they exist in SQL) in its first GA release, but the functionality will be added eventually.

Existing issue #5308 captures some use cases from the community that may have benefitted from support for window functions.

philrz avatar May 22 '25 21:05 philrz

Here's another example that may benefit from the availability of window functions. There's an existing community zync user that has been using pipe-based queries in the language since before SQL support was introduced. They've been using an idiom that looks something like:

...
  | sort ordering_key
  | collect(this) by partition_key
  | values {partition_key, first_value:collect[1].value_field}

They ultimately do a left join on partition_key of the original data vs. the output of the above to add the first_value to each record from the original data.

Here's a simplified example of this idiom in use using generic data. Given input in sales.csv:

region,salesperson,sale_date,amount
North,Alice,2024-01-01,1000
North,Bob,2024-01-02,1200
North,Alice,2024-01-03,1100
South,Charlie,2024-01-01,1500
South,Diana,2024-01-02,1300
South,Charlie,2024-01-04,1400

We can add an additional field that shows the salesperson that made the first sale of the month in each region by using the idiom:

$ super -version
Version: 9fcbd222b

$ super -c "
from 'sales.csv'
| left join (
  from 'sales.csv'
  | sort sale_date
  | collect(this) by region
  | values {region, first_salesperson_in_region:collect[1].salesperson}
) on left.region=right.region
| values {...left, first_salesperson_in_region:right.first_salesperson_in_region}"

{region:"North",salesperson:"Alice",sale_date:"2024-01-01",amount:1000.,first_salesperson_in_region:"Alice"}
{region:"North",salesperson:"Bob",sale_date:"2024-01-02",amount:1200.,first_salesperson_in_region:"Alice"}
{region:"North",salesperson:"Alice",sale_date:"2024-01-03",amount:1100.,first_salesperson_in_region:"Alice"}
{region:"South",salesperson:"Charlie",sale_date:"2024-01-01",amount:1500.,first_salesperson_in_region:"Charlie"}
{region:"South",salesperson:"Diana",sale_date:"2024-01-02",amount:1300.,first_salesperson_in_region:"Charlie"}
{region:"South",salesperson:"Charlie",sale_date:"2024-01-04",amount:1400.,first_salesperson_in_region:"Charlie"}

This is functionally correct, but the user's real world data happens to have many large values, so we've observed that the collect phase in particular is quite resource intensive.

There may be multiple ways to improve upon this, but one available in the SQL standard is the FIRST_VALUE window function, e.g.,

$ duckdb --version
v1.3.0 71c5c07cdd

$ duckdb -c "
SELECT
    region,
    salesperson,
    sale_date,
    amount,
    FIRST_VALUE(salesperson) OVER (
        PARTITION BY region
        ORDER BY sale_date
    ) AS first_salesperson_in_region
FROM 'sales.csv'
ORDER BY region, sale_date;"
┌─────────┬─────────────┬────────────┬────────┬─────────────────────────────┐
│ region  │ salesperson │ sale_date  │ amount │ first_salesperson_in_region │
│ varchar │   varchar   │    date    │ int64  │           varchar           │
├─────────┼─────────────┼────────────┼────────┼─────────────────────────────┤
│ North   │ Alice       │ 2024-01-01 │   1000 │ Alice                       │
│ North   │ Bob         │ 2024-01-02 │   1200 │ Alice                       │
│ North   │ Alice       │ 2024-01-03 │   1100 │ Alice                       │
│ South   │ Charlie     │ 2024-01-01 │   1500 │ Charlie                     │
│ South   │ Diana       │ 2024-01-02 │   1300 │ Charlie                     │
│ South   │ Charlie     │ 2024-01-04 │   1400 │ Charlie                     │
└─────────┴─────────────┴────────────┴────────┴─────────────────────────────┘

Bottom line: Once window functions are available in SuperSQL, a straightforward implementation of a window function like FIRST_VALUE may be more performant than the idiom the user is currently using, and users may also appreciate the availability of a standard SQL window function as well.

philrz avatar May 22 '25 22:05 philrz