prql icon indicating copy to clipboard operation
prql copied to clipboard

Should window default to `expanding` with a `sort`?

Open max-sixty opened this issue 3 years ago • 23 comments

Currently we default to a window clause of including the whole window — aka rows:..

SQL's default window definition changes depending on whether there's an ORDER BY, between:

  • expanding — aka rows:..0 or expanding:true when an ORDER BY exists
  • or the whole window — aka rows:.. when no ORDER BY exists

So we force that to be explicit. I'm not that confident we're doing the right thing here; the SQL default might be surprising, but you literally never want the whole window when supplying ORDER BY; or you wouldn't supply the ORDER BY.

For context, this came up in a Discord discussion around this example:

with raw as (select 
  timechunk, 
  keyword, 
  sumQuantity, 
  rank() over (
    partition by timechunk 
    order by sumQuantity desc
  ) as rank
FROM (
  select 
    toStartOfInterval(timestamp, interval 3 hour) as timechunk,
    keyword,
    sum(quantity) as sumQuantity
    from foo
    group by toStartOfInterval(timestamp, interval 3 hour), keyword
    order by 3
) 
ORDER BY 1, 4) SELECT * from raw where rank <= 5

to PRQL:

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort sumQuantity
  window expanding:true (
    derive rank = rank
  )
)
sort [timechunk, sumQuantity]
filter rank <= 5

If we had the same defaults as SQL, the PRQL could instead be:

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort sumQuantity
+  derive rank = rank
-  window expanding:true (
-    derive rank = rank
-  )
)
sort [timechunk, sumQuantity]
filter rank <= 5

max-sixty avatar Oct 22 '22 21:10 max-sixty

I think it makes sense to adopt the same defaults as SQL, since as you point out it doesn't make much sense otherwise.


On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen. I would rather have the window function be specified but get rid of the derive like we have for aggregate. So my proposal would be (I changed the sort to DESC since that seems to have got lost since the Discord thread and I also renamed chunkRank since I found rank = rank confusing):

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort [-sumQuantity]
  window [
    chunkRank = rank
  ]
)
sort [timechunk, sumQuantity]
filter chunkRank <= 5

I know that's not how our window works at the moment but bringing it up here again as it's a good example.

EDIT: From the playground apparently sort -sumQuantity doesn't work and it has to be sort [-sumQuantity].


Given the description of the query in English, I'm not sure you want a window function in this case though:

And the query in English is: For every three-hour bucket of timestamp, show the five most interesting keywords, where interesting is defined as the largest sum(quantity)

Actually, the more I'm looking into this I don't think I understand the current queries. I'll respond in a separate comment as it might get a bit lenghty.

snth avatar Oct 24 '22 09:10 snth

I'm trying to understand the query as per description:

And the query in English is: For every three-hour bucket of timestamp, show the five most interesting keywords, where interesting is defined as the largest sum(quantity)

Original SQL

with raw as (select 
  timechunk, 
  keyword, 
  sumQuantity, 
  rank() over (
    partition by timechunk 
    order by sumQuantity desc
  ) as rank
FROM (
  select 
    toStartOfInterval(timestamp, interval 3 hour) as timechunk,
    keyword,
    sum(quantity) as sumQuantity
    from foo
    group by toStartOfInterval(timestamp, interval 3 hour), keyword
    order by 3
) 
ORDER BY 1, 4) SELECT * from raw where rank <= 5

Factoring out the FROM clause into a separate CTE:

with grouped as (
    select 
        toStartOfInterval(timestamp, interval 3 hour) as timechunk,
        keyword,
        sum(quantity) as sumQuantity
    from foo
    group by 
        toStartOfInterval(timestamp, interval 3 hour), 
        keyword
    order by 3
),
raw as (
    select 
        timechunk, 
        keyword, 
        sumQuantity, 
        rank() over (
            partition by timechunk 
            order by sumQuantity desc
         ) as rank
    from grouped
    order by 1, 4
) 
SELECT * 
from raw 
where rank <= 5

This is somewhat tangential to the question but I think this is a good example of why I believe in PRQL we should disallow inline (sub)queries in transforms such as FROM, JOIN, UNION - it makes things very confusing and breaks the linear dataflow. Given the existence of CTEs / table blocks, I argue it's ok to force people to put their transform logic there so that the reader then has a directed dataflow.


Returning to the question, in the raw CTE, does the RANK() not need to be over all rows in the partition? If the rows come streamed in in descending order of sumQuantity then wouldn't a simple ROW_NUMBER() be enough? (I'm not sure of the details tbh as I'm one of those that gets the details of the row selections in SQL Window clauses wrong all the time.)

I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL?

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort [-sumQuantity]
  take 5
)

The generated SQL for this is the following and I can't tell if it's correct or not because I can't picture in my mind how the ROW_NUMBER() window function interacts with the GROUP BY of the query that it's embedded in. Would be good if we had some sample data to test these with.

WITH table_0 AS (
  SELECT
    keyword,
    toStartOfInterval(timestamp, interval 3 hour) AS timechunk,
    SUM(quantity) AS "sumQuantity",
    ROW_NUMBER() OVER (
      PARTITION BY toStartOfInterval(timestamp, interval 3 hour)
      ORDER BY
        SUM(quantity) DESC
    ) AS _rn_84
  FROM
    foo
  GROUP BY
    keyword,
    toStartOfInterval(timestamp, interval 3 hour)
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  _rn_84 <= 5

snth avatar Oct 24 '22 10:10 snth

I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL?

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort [-sumQuantity]
  take 5
)

This is brilliant — @aljazerzen spends a great deal of energy building the group & take interaction, and I don't use it! Thanks @snth. I may put this in the canonical examples.

max-sixty avatar Oct 24 '22 17:10 max-sixty

On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen.

Yeah, I put this in the bucket of "It's better. But it's less familiar with a mental model of SQL". I'm fine opening an issue and collecting thoughts; these issues are a delicate tradeoff rather than something we can discuss our way to a solution.

FWIW your final example — using the group & take transforms — is also using implicit window functions!

max-sixty avatar Oct 24 '22 17:10 max-sixty

the SQL default might be surprising, but you literally never want the whole window when supplying ORDER BY; or you wouldn't supply the ORDER BY.

That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK.

SELECT 
  id,
  value,
  LAG(value) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
  RANK() OVER (ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

This works the same as it would with expanding window (which is the default), because window functions that map column -> column are not affected by range or rows at all.

So I think that being explicit is good in this case and we should stick to being predictable instead inheriting SQL defaults.

aljazerzen avatar Oct 28 '22 15:10 aljazerzen

Some notes on how I think about windowing (consider this an optional read, I want to formulate my thoughts).

There are two types of windowing:

  • actual window functions that produce the whole column at once (LAG, RANK, NTILE, ...). They map n rows into n rows by definition, which means that specifying a window frame does not make sense, since they always use and produce the whole window at once.
  • aggregation functions that are computed for each row, from multiple rows (SUM, COUNT, FIRST_VALUE). They map n values into one, which means that it does make sense to adjust window frame i.e. which rows are used when computing a row value.

side note: if an aggregation window uses full window, the result can be broadcast to all rows, since it will be the same anyways.

aljazerzen avatar Oct 28 '22 15:10 aljazerzen

Hey folks, I'm the person who posted the example that motivated this ticket, thanks for your time and effort engaging with my use case!

A new wrinkle I'm thinking about: the clients of my software also expect to see the number of "other" records that didn't have one of the top-k-ranked values. Can you think of a natural/easy/feasible way to adjust the PRQL to include that information? e.g.:

Table schema

CREATE TABLE foo (
  timestamp timestamp not null,
  colour varchar(16) not null
)

Human intent query

"Top 3 colours by day"

Desired result:

  • 2001-02-03:
    • red: 5
    • blue: 7
    • brown: 9
    • [other]: 22
  • 2001-02-03:
    • red: 6
    • blue: 9
    • brown: 12
    • [other]: 34

acruise avatar Dec 08 '22 19:12 acruise

I believe you are want to do this:

table daily_colors = (
    from foo
    group [day, colour] (aggregate cnt = count)
    group [day] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
    from daily_colors
    filter rnk <= 3
)
table others = (
    from daily_colors
    filter rnk > 3
    group [day] (aggregate cnt = count)
    derive color = '[other]'
)

# just concat the two tables together.
# we currenty don't have union implemented 
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"

And something seems off with produced SQL. The first table should not fit into one CTE...

aljazerzen avatar Dec 08 '22 19:12 aljazerzen

(This example is about an actual window function, where frame does not matter. In my query, row_number is not within a window so default is used: the whole group)

aljazerzen avatar Dec 08 '22 19:12 aljazerzen

Briefly on one point — and sorry for not having responded above

That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK.

...is absolutely right and I agree with the cases (LEAD is an even better example than LAG which technically would work with a rolling windows IIUC).

And more broadly, I'm not sure there's a reasonable standard for defaults beyond having things fully explicit

max-sixty avatar Dec 08 '22 20:12 max-sixty

There are two types of windowing:

* actual window functions that produce the whole column at once (LAG, RANK, NTILE, ...). They map n rows into n rows by definition, which means that specifying a window frame does not make sense, since they always use and produce the whole window at once.

* aggregation functions that are computed for each row, from multiple rows (SUM, COUNT, FIRST_VALUE). They map n values into one, which means that it does make sense to adjust window frame i.e. which rows are used when computing a row value.

~One slight refinement — it can still be helpful to have a window function in the first set of functions. Say you want "what's the rank of my score relative to the scores I've seen before" — then you want a RANK() OVER (ORDER BY date).~ (edit below)

I agree the majority case is quite different — which I guess is why SQL has different defaults. I'm not sure how to make a standard for those — both types technically reduce n values to 1 value for each row. (which agrees with your perspective on having the same default for both)

max-sixty avatar Dec 08 '22 20:12 max-sixty

"what's the rank of my score relative to the scores I've seen before"

Well yes - you can think about passing only a rolling frame to rank, but it's implemented as an actual window function and it is not affected by frames at all. These are all equivalent:

RANK() OVER (ORDER BY date)
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
RANK() OVER (ORDER BY id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

The last one especially does not make sense, but in SQL this doesn't even matter.

(edit: fixed CURRENT ROW)


I'm not sure there's a reasonable standard for defaults beyond having things fully explicit

I also think that. And this means that we can choose our default.

aljazerzen avatar Dec 08 '22 20:12 aljazerzen

My message above is wrong — I was thinking of ranking one column (e.g. score) in a rolling window of another (e.g. date). SQL doesn't do that.

Not sure if I'm still thinking about it wrong, but I would have thought that

RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT

...every row would be rank 1, because it's always the highest value in its frame?

Though BigQuery doesn't allow this when I attempted to test it —  Window framing clause is not allowed for analytic function rank at [4:31]. (Maybe that's what you're saying)

max-sixty avatar Dec 09 '22 00:12 max-sixty

Yes, that probably because of what I'm saying. PG simply ignores the frame, while BQ apparently throws an error. Which means we should probably throw an error too...

This thread has many tangential examples and discussions. I'll open an issue about this last thing, but we need some closure on discussion about the frame default when sorting.

aljazerzen avatar Dec 09 '22 08:12 aljazerzen

I got the query to work in Postgres, I just had to add the following line to top3:

    select [d, cnt, c]

Final:

table daily_colors = (
    from foo
    group [d, c] (aggregate cnt = count)
    group [d] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
    from daily_colors
    filter rnk <= 3
    select [d, cnt, c]
)
table others = (
    from daily_colors
    filter rnk > 3
    group [d] (aggregate cnt = count)
    derive c = '[other]'
)
# just concat the two tables together.
# we currenty don't have union implemented 
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"
sort [d, -cnt]

I also added the

sort [d, -cnt]

at the end because the results were in a random order. Interestingly this actually also produced an error because it got translated to

ORDER BY *, * DESC

because of the opaqueness of that s-string from.

Final SQL example I ran:

with foo(d, c) as (
    values (1, 'r'), (1, 'r'), (1, 'r'), (1, 'b'), (1, 'b'), (1, 'g'), (1, 'c'), (1, 'p'),
        (2, 'r'), (2, 'r'), (2, 'b'), (2, 'b'), (2, 'b'), (2, 'g'), (2, 'c')
),
daily_colors AS (
  SELECT
    d,
    c,
    COUNT(*) AS cnt,
    ROW_NUMBER() OVER (
      PARTITION BY d
      ORDER BY
        COUNT(*) DESC ROWS BETWEEN UNBOUNDED PRECEDING
        AND UNBOUNDED FOLLOWING
    ) AS rnk
  FROM
    foo
  GROUP BY
    d,
    c
),
others AS (
  SELECT
    d,
    COUNT(*) AS cnt,
    '[other]' AS c
  FROM
    daily_colors
  WHERE
    rnk > 3
  GROUP BY
    d
),
top3 AS (
  SELECT
    d,
    cnt,
    c
  FROM
    daily_colors
  WHERE
    rnk <= 3
),
table_1 AS (
  SELECT
    *
  FROM
    top3
  UNION
  ALL
  SELECT
    *
  FROM
    others
)
SELECT
  *
FROM
  table_1 AS table_0
ORDER BY
  d,
  cnt DESC

which produced

1	3	r
1	2	[other]
1	2	b
1	1	p
2	3	b
2	2	r
2	1	g
2	1	[other]

snth avatar Dec 09 '22 10:12 snth

I had the same expectation as @max-sixty that every row would be rank 1 in the example below:

Not sure if I'm still thinking about it wrong, but I would have thought that

RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT

...every row would be rank 1, because it's always the highest value in its frame?

I tested this in postgres with:

with foo(i, n) as (
    values (1, 300), (2, 50000), (3, 1),
        (4, 20), (5, 600000), (6, 4000)
)
select *
    , RANK() OVER (ORDER BY n)
    , RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as rank2
    , RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as rank3
    , RANK() OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as rank4
    , SUM(n) OVER (ORDER BY n)
    , SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum2
    , SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum3
    , SUM(n) OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as sum4
from foo
order by i

which produces image

The SUM() examples behave as I expect and respect the supplied frames, so why does RANK() just ignore the frames as you said @aljazerzen ? That's very surprising and confusing.

snth avatar Dec 09 '22 11:12 snth

Because RANK computes the whole column at once while SUM computes an aggregate at each of the rows.

Imagine you'd have:

def rank(values: List(number)) -> List(number):
    return [i for i, v in enumerate(values)]

def sum(values: List(number)) -> number:
    s = 0
    for v in values:
        s += v
    return s

For implementation of window functions, you'd check if return value of the function is List or not:

  • if it is, call it only once and use output as the new column,
  • if it is not, iterate over rows and call the function once for each row. Here you can limit what you pass into the call by the window frame.

Note 1: second option is probably not implemented like this (having O(n^2)) - you can use cumsum, but this is just an illustration.

Note 2: rank could be implemented as aggregation:

def rank(values: List(number)) -> number:
    return len(values)

... but this is much less more efficient and does not hold for all such functions, such as NTILE.

Note 3: both my RANK impls are simplified, they don't handle ties. It could be that you can't implement actual RANK as aggregation.


In summary: actual window functions don't use window frame (or even reject it BQ).

So I suggest we default to no frame (that it the whole table/window/group).

aljazerzen avatar Dec 09 '22 11:12 aljazerzen

Thanks. I understand there may be performance implications and some combinations may not be feasible. My complaint is more about how this is documented and communicated. I admit I probably did SQL for 10 years before I ever even heard of WINDOW functions but I was not aware of this distinction until now. Looking over the Postgresql documentation again now, I see it is in there but IMHO it's quite subtle:

9.22. Window Functions

Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. ... ... When an aggregate function is used as a window function, it aggregates over the rows within the current row's window frame.

I guess the implication is that the rest of the functions in that list DO NOT consider the "window frame".

There's also 4.2.8. Window Function Calls which has a section

The optional frame_clause can be one of ...

I just skimmed that section now but there's nothing jumping out at me that says that the optional frame_clause only applies to some window functions.

I think maybe one way this could be made clearer would be to call first_value, last_value, and nth_value FRAME FUNCTIONS which are similar to WINDOW FUNCTIONS and use the same OVER (...) syntax with the addition that FRAME FUNCTIONS can also accept an optional frame_clause. Moreover AGGREGATE FUNCTIONS that act over a OVER (...) clause are actually FRAME FUNCTIONS.

I think that's how I'm going to try and remember it internally.


I find this a pretty tricky space with many footguns lying around. I agree with you that we should try and make this consistent in PRQL. I'll need to step away from this for a bit to think about what I think would be a better behaviour to try and keep it logical and consistent for our users.


Two related functions where I've also gotten tripped up in the past and spent hours debugging queries is with FIRST_VALUE and LAST_VALUE. I don't recall the details now but I vaguely recall that you're better off with FIRST_VALUE () OVER (ORDER BY ... DESC) than simply LAST_VALUE() OVER (ORDER BY ...) but I don't recall why. It also had something to do with the FRAME.

snth avatar Dec 09 '22 12:12 snth

So to confirm my SQL understanding, it seems there are three categories of window / "analytic" functions:

  • Aggregate analytic functions — sum, count, etc — aggregate functions with a window clause
    • By default, these use RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (edit:) if an ORDER BY exists
    • ...or the full window if there's no ORDER BY
  • Most "window functions"
    • Two types: Navigation functions such as LAG and Numbering functions such as RANK
    • In BQ, these don't accept window clauses, raising an error — they always operate over the full window
    • Postgres docs seem to imply the same default, (though by omission, and I haven't tested it)
  • _VALUE window functions — FIRST_VALUE, LAST_VALUE, NTH_VALUE
    • These do accept window clauses
    • In BQ & postgres these use the same RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW default (reference below for postgres, and I tested it myself on BQ[^1])

      Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for last_value and sometimes also nth_value. You can redefine the frame by adding a suitable frame specification (RANGE, ROWS or GROUPS) to the OVER clause. See Section 4.2.8 for more information about frame specifications.

References:

  • Postgres: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
  • BQ: https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls#def_window_frame

[^1]: BQ states "You can't use a window frame clause with navigation functions and numbering functions" in their docs, and lists the _VALUE functions in Navigation functions. But it also contains examples of _VALUE functions using window clauses, and they work when testing

Is that right?

max-sixty avatar Dec 09 '22 23:12 max-sixty

That sums it up well and is my understanding as well.

snth avatar Dec 10 '22 13:12 snth

If that's the case, then I can see a few options for our defaults:

  1. Inherit SQL defaults
    • Everything works now. The defaults are arguably normally the thing folks want. But the defaults are complicated and confusing.
  2. Default to a full window
    • Until we encode the SQL defaults, some dialects won't be able to do LAG / LEAD, because they'll raise an error when we try and supply a window clause.
    • ...though maybe we could add a type which will not write a window clause, because we know the default is the full window
    • It's a very simple standard
  3. Try and come up with some better standard
    • Based on @aljazerzen 's points, I don't think this is viable — any standard that does what you want most of the time ends up being overly complicated to document and understand

I would probably vote for ~1~ 2: Full Window if we think it's possible to work around using something like types in the near future (and ofc happy to help with the work). I could also see resting on ~2~ 1: Inherit SQL Defaults for a while being reasonable, before later switching.

max-sixty avatar Dec 10 '22 21:12 max-sixty

I vote +2 on the full window (which is also the current behavior).

aljazerzen avatar Dec 23 '22 13:12 aljazerzen

(I realize my numbers didn't agree with my titles, so fixed above)

Full window was also my vote; do you think it's viable to also let dialects like BQ work with the "Most window functions" group? I think we could even just hardcode them in the compiler to not emit window clauses, assuming there isn't a cleverer way.

max-sixty avatar Dec 27 '22 02:12 max-sixty

I think we can close this as decided, please reopen for any disagreements!

max-sixty avatar Jan 05 '23 17:01 max-sixty