prql
prql copied to clipboard
Should window default to `expanding` with a `sort`?
Currently we default to a window clause of including the whole window — aka rows:..
SQL's default window definition changes depending on whether there's an ORDER BY, between:
- expanding — aka
rows:..0orexpanding:truewhen anORDER BYexists - or the whole window — aka
rows:..when noORDER BYexists
So we force that to be explicit. I'm not that confident we're doing the right thing here; the SQL default might be surprising, but you literally never want the whole window when supplying ORDER BY; or you wouldn't supply the ORDER BY.
For context, this came up in a Discord discussion around this example:
with raw as (select
timechunk,
keyword,
sumQuantity,
rank() over (
partition by timechunk
order by sumQuantity desc
) as rank
FROM (
select
toStartOfInterval(timestamp, interval 3 hour) as timechunk,
keyword,
sum(quantity) as sumQuantity
from foo
group by toStartOfInterval(timestamp, interval 3 hour), keyword
order by 3
)
ORDER BY 1, 4) SELECT * from raw where rank <= 5
to PRQL:
from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
sort sumQuantity
window expanding:true (
derive rank = rank
)
)
sort [timechunk, sumQuantity]
filter rank <= 5
If we had the same defaults as SQL, the PRQL could instead be:
from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
sort sumQuantity
+ derive rank = rank
- window expanding:true (
- derive rank = rank
- )
)
sort [timechunk, sumQuantity]
filter rank <= 5
I think it makes sense to adopt the same defaults as SQL, since as you point out it doesn't make much sense otherwise.
On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen. I would rather have the window function be specified but get rid of the derive like we have for aggregate. So my proposal would be (I changed the sort to DESC since that seems to have got lost since the Discord thread and I also renamed chunkRank since I found rank = rank confusing):
from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
sort [-sumQuantity]
window [
chunkRank = rank
]
)
sort [timechunk, sumQuantity]
filter chunkRank <= 5
I know that's not how our window works at the moment but bringing it up here again as it's a good example.
EDIT: From the playground apparently sort -sumQuantity doesn't work and it has to be sort [-sumQuantity].
Given the description of the query in English, I'm not sure you want a window function in this case though:
And the query in English is: For every three-hour bucket of timestamp, show the five most interesting keywords, where interesting is defined as the largest sum(quantity)
Actually, the more I'm looking into this I don't think I understand the current queries. I'll respond in a separate comment as it might get a bit lenghty.
I'm trying to understand the query as per description:
And the query in English is: For every three-hour bucket of timestamp, show the five most interesting keywords, where interesting is defined as the largest sum(quantity)
Original SQL
with raw as (select
timechunk,
keyword,
sumQuantity,
rank() over (
partition by timechunk
order by sumQuantity desc
) as rank
FROM (
select
toStartOfInterval(timestamp, interval 3 hour) as timechunk,
keyword,
sum(quantity) as sumQuantity
from foo
group by toStartOfInterval(timestamp, interval 3 hour), keyword
order by 3
)
ORDER BY 1, 4) SELECT * from raw where rank <= 5
Factoring out the FROM clause into a separate CTE:
with grouped as (
select
toStartOfInterval(timestamp, interval 3 hour) as timechunk,
keyword,
sum(quantity) as sumQuantity
from foo
group by
toStartOfInterval(timestamp, interval 3 hour),
keyword
order by 3
),
raw as (
select
timechunk,
keyword,
sumQuantity,
rank() over (
partition by timechunk
order by sumQuantity desc
) as rank
from grouped
order by 1, 4
)
SELECT *
from raw
where rank <= 5
This is somewhat tangential to the question but I think this is a good example of why I believe in PRQL we should disallow inline (sub)queries in transforms such as FROM, JOIN, UNION - it makes things very confusing and breaks the linear dataflow. Given the existence of CTEs / table blocks, I argue it's ok to force people to put their transform logic there so that the reader then has a directed dataflow.
Returning to the question, in the raw CTE, does the RANK() not need to be over all rows in the partition? If the rows come streamed in in descending order of sumQuantity then wouldn't a simple ROW_NUMBER() be enough? (I'm not sure of the details tbh as I'm one of those that gets the details of the row selections in SQL Window clauses wrong all the time.)
I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL?
from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
sort [-sumQuantity]
take 5
)
The generated SQL for this is the following and I can't tell if it's correct or not because I can't picture in my mind how the ROW_NUMBER() window function interacts with the GROUP BY of the query that it's embedded in. Would be good if we had some sample data to test these with.
WITH table_0 AS (
SELECT
keyword,
toStartOfInterval(timestamp, interval 3 hour) AS timechunk,
SUM(quantity) AS "sumQuantity",
ROW_NUMBER() OVER (
PARTITION BY toStartOfInterval(timestamp, interval 3 hour)
ORDER BY
SUM(quantity) DESC
) AS _rn_84
FROM
foo
GROUP BY
keyword,
toStartOfInterval(timestamp, interval 3 hour)
)
SELECT
table_0.*
FROM
table_0
WHERE
_rn_84 <= 5
I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL?
from foo derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"] group [keyword, timechunk] ( aggregate [sumQuantity = sum quantity] ) group [timechunk] ( sort [-sumQuantity] take 5 )
This is brilliant — @aljazerzen spends a great deal of energy building the group & take interaction, and I don't use it! Thanks @snth. I may put this in the canonical examples.
On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen.
Yeah, I put this in the bucket of "It's better. But it's less familiar with a mental model of SQL". I'm fine opening an issue and collecting thoughts; these issues are a delicate tradeoff rather than something we can discuss our way to a solution.
FWIW your final example — using the group & take transforms — is also using implicit window functions!
the SQL default might be surprising, but you literally never want the whole window when supplying ORDER BY; or you wouldn't supply the ORDER BY.
That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK.
SELECT
id,
value,
LAG(value) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
RANK() OVER (ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
This works the same as it would with expanding window (which is the default), because window functions that map column -> column are not affected by range or rows at all.
So I think that being explicit is good in this case and we should stick to being predictable instead inheriting SQL defaults.
Some notes on how I think about windowing (consider this an optional read, I want to formulate my thoughts).
There are two types of windowing:
- actual window functions that produce the whole column at once (LAG, RANK, NTILE, ...). They map n rows into n rows by definition, which means that specifying a window frame does not make sense, since they always use and produce the whole window at once.
- aggregation functions that are computed for each row, from multiple rows (SUM, COUNT, FIRST_VALUE). They map n values into one, which means that it does make sense to adjust window frame i.e. which rows are used when computing a row value.
side note: if an aggregation window uses full window, the result can be broadcast to all rows, since it will be the same anyways.
Hey folks, I'm the person who posted the example that motivated this ticket, thanks for your time and effort engaging with my use case!
A new wrinkle I'm thinking about: the clients of my software also expect to see the number of "other" records that didn't have one of the top-k-ranked values. Can you think of a natural/easy/feasible way to adjust the PRQL to include that information? e.g.:
Table schema
CREATE TABLE foo (
timestamp timestamp not null,
colour varchar(16) not null
)
Human intent query
"Top 3 colours by day"
Desired result:
- 2001-02-03:
- red: 5
- blue: 7
- brown: 9
- [other]: 22
- 2001-02-03:
- red: 6
- blue: 9
- brown: 12
- [other]: 34
I believe you are want to do this:
table daily_colors = (
from foo
group [day, colour] (aggregate cnt = count)
group [day] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
from daily_colors
filter rnk <= 3
)
table others = (
from daily_colors
filter rnk > 3
group [day] (aggregate cnt = count)
derive color = '[other]'
)
# just concat the two tables together.
# we currenty don't have union implemented
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"
And something seems off with produced SQL. The first table should not fit into one CTE...
(This example is about an actual window function, where frame does not matter. In my query, row_number is not within a window so default is used: the whole group)
Briefly on one point — and sorry for not having responded above
That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK.
...is absolutely right and I agree with the cases (LEAD is an even better example than LAG which technically would work with a rolling windows IIUC).
And more broadly, I'm not sure there's a reasonable standard for defaults beyond having things fully explicit
There are two types of windowing:
* actual window functions that produce the whole column at once (LAG, RANK, NTILE, ...). They map n rows into n rows by definition, which means that specifying a window frame does not make sense, since they always use and produce the whole window at once. * aggregation functions that are computed for each row, from multiple rows (SUM, COUNT, FIRST_VALUE). They map n values into one, which means that it does make sense to adjust window frame i.e. which rows are used when computing a row value.
~One slight refinement — it can still be helpful to have a window function in the first set of functions. Say you want "what's the rank of my score relative to the scores I've seen before" — then you want a RANK() OVER (ORDER BY date).~ (edit below)
I agree the majority case is quite different — which I guess is why SQL has different defaults. I'm not sure how to make a standard for those — both types technically reduce n values to 1 value for each row. (which agrees with your perspective on having the same default for both)
"what's the rank of my score relative to the scores I've seen before"
Well yes - you can think about passing only a rolling frame to rank, but it's implemented as an actual window function and it is not affected by frames at all. These are all equivalent:
RANK() OVER (ORDER BY date)
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
RANK() OVER (ORDER BY id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
The last one especially does not make sense, but in SQL this doesn't even matter.
(edit: fixed CURRENT ROW)
I'm not sure there's a reasonable standard for defaults beyond having things fully explicit
I also think that. And this means that we can choose our default.
My message above is wrong — I was thinking of ranking one column (e.g. score) in a rolling window of another (e.g. date). SQL doesn't do that.
Not sure if I'm still thinking about it wrong, but I would have thought that
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
...every row would be rank 1, because it's always the highest value in its frame?
Though BigQuery doesn't allow this when I attempted to test it — Window framing clause is not allowed for analytic function rank at [4:31]. (Maybe that's what you're saying)
Yes, that probably because of what I'm saying. PG simply ignores the frame, while BQ apparently throws an error. Which means we should probably throw an error too...
This thread has many tangential examples and discussions. I'll open an issue about this last thing, but we need some closure on discussion about the frame default when sorting.
I got the query to work in Postgres, I just had to add the following line to top3:
select [d, cnt, c]
Final:
table daily_colors = (
from foo
group [d, c] (aggregate cnt = count)
group [d] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
from daily_colors
filter rnk <= 3
select [d, cnt, c]
)
table others = (
from daily_colors
filter rnk > 3
group [d] (aggregate cnt = count)
derive c = '[other]'
)
# just concat the two tables together.
# we currenty don't have union implemented
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"
sort [d, -cnt]
I also added the
sort [d, -cnt]
at the end because the results were in a random order. Interestingly this actually also produced an error because it got translated to
ORDER BY *, * DESC
because of the opaqueness of that s-string from.
Final SQL example I ran:
with foo(d, c) as (
values (1, 'r'), (1, 'r'), (1, 'r'), (1, 'b'), (1, 'b'), (1, 'g'), (1, 'c'), (1, 'p'),
(2, 'r'), (2, 'r'), (2, 'b'), (2, 'b'), (2, 'b'), (2, 'g'), (2, 'c')
),
daily_colors AS (
SELECT
d,
c,
COUNT(*) AS cnt,
ROW_NUMBER() OVER (
PARTITION BY d
ORDER BY
COUNT(*) DESC ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS rnk
FROM
foo
GROUP BY
d,
c
),
others AS (
SELECT
d,
COUNT(*) AS cnt,
'[other]' AS c
FROM
daily_colors
WHERE
rnk > 3
GROUP BY
d
),
top3 AS (
SELECT
d,
cnt,
c
FROM
daily_colors
WHERE
rnk <= 3
),
table_1 AS (
SELECT
*
FROM
top3
UNION
ALL
SELECT
*
FROM
others
)
SELECT
*
FROM
table_1 AS table_0
ORDER BY
d,
cnt DESC
which produced
1 3 r
1 2 [other]
1 2 b
1 1 p
2 3 b
2 2 r
2 1 g
2 1 [other]
I had the same expectation as @max-sixty that every row would be rank 1 in the example below:
Not sure if I'm still thinking about it wrong, but I would have thought that
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT...every row would be rank 1, because it's always the highest value in its frame?
I tested this in postgres with:
with foo(i, n) as (
values (1, 300), (2, 50000), (3, 1),
(4, 20), (5, 600000), (6, 4000)
)
select *
, RANK() OVER (ORDER BY n)
, RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as rank2
, RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as rank3
, RANK() OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as rank4
, SUM(n) OVER (ORDER BY n)
, SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum2
, SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum3
, SUM(n) OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as sum4
from foo
order by i
which produces

The SUM() examples behave as I expect and respect the supplied frames, so why does RANK() just ignore the frames as you said @aljazerzen ? That's very surprising and confusing.
Because RANK computes the whole column at once while SUM computes an aggregate at each of the rows.
Imagine you'd have:
def rank(values: List(number)) -> List(number):
return [i for i, v in enumerate(values)]
def sum(values: List(number)) -> number:
s = 0
for v in values:
s += v
return s
For implementation of window functions, you'd check if return value of the function is List or not:
- if it is, call it only once and use output as the new column,
- if it is not, iterate over rows and call the function once for each row. Here you can limit what you pass into the call by the window frame.
Note 1: second option is probably not implemented like this (having O(n^2)) - you can use cumsum, but this is just an illustration.
Note 2: rank could be implemented as aggregation:
def rank(values: List(number)) -> number:
return len(values)
... but this is much less more efficient and does not hold for all such functions, such as NTILE.
Note 3: both my RANK impls are simplified, they don't handle ties. It could be that you can't implement actual RANK as aggregation.
In summary: actual window functions don't use window frame (or even reject it BQ).
So I suggest we default to no frame (that it the whole table/window/group).
Thanks. I understand there may be performance implications and some combinations may not be feasible. My complaint is more about how this is documented and communicated. I admit I probably did SQL for 10 years before I ever even heard of WINDOW functions but I was not aware of this distinction until now. Looking over the Postgresql documentation again now, I see it is in there but IMHO it's quite subtle:
Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. ... ... When an aggregate function is used as a window function, it aggregates over the rows within the current row's window frame.
I guess the implication is that the rest of the functions in that list DO NOT consider the "window frame".
There's also 4.2.8. Window Function Calls which has a section
The optional frame_clause can be one of ...
I just skimmed that section now but there's nothing jumping out at me that says that the optional frame_clause only applies to some window functions.
I think maybe one way this could be made clearer would be to call first_value, last_value, and nth_value FRAME FUNCTIONS which are similar to WINDOW FUNCTIONS and use the same OVER (...) syntax with the addition that FRAME FUNCTIONS can also accept an optional frame_clause. Moreover AGGREGATE FUNCTIONS that act over a OVER (...) clause are actually FRAME FUNCTIONS.
I think that's how I'm going to try and remember it internally.
I find this a pretty tricky space with many footguns lying around. I agree with you that we should try and make this consistent in PRQL. I'll need to step away from this for a bit to think about what I think would be a better behaviour to try and keep it logical and consistent for our users.
Two related functions where I've also gotten tripped up in the past and spent hours debugging queries is with FIRST_VALUE and LAST_VALUE. I don't recall the details now but I vaguely recall that you're better off with FIRST_VALUE () OVER (ORDER BY ... DESC) than simply LAST_VALUE() OVER (ORDER BY ...) but I don't recall why. It also had something to do with the FRAME.
So to confirm my SQL understanding, it seems there are three categories of window / "analytic" functions:
- Aggregate analytic functions — sum, count, etc — aggregate functions with a window clause
- By default, these use
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(edit:) if anORDER BYexists - ...or the full window if there's no
ORDER BY
- By default, these use
- Most "window functions"
- Two types: Navigation functions such as
LAGand Numbering functions such asRANK - In BQ, these don't accept window clauses, raising an error — they always operate over the full window
- Postgres docs seem to imply the same default, (though by omission, and I haven't tested it)
- Two types: Navigation functions such as
_VALUEwindow functions —FIRST_VALUE,LAST_VALUE,NTH_VALUE- These do accept window clauses
- In BQ & postgres these use the same
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWdefault (reference below for postgres, and I tested it myself on BQ[^1])Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for last_value and sometimes also nth_value. You can redefine the frame by adding a suitable frame specification (RANGE, ROWS or GROUPS) to the OVER clause. See Section 4.2.8 for more information about frame specifications.
References:
- Postgres: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
- BQ: https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls#def_window_frame
[^1]: BQ states "You can't use a window frame clause with navigation functions and numbering functions" in their docs, and lists the _VALUE functions in Navigation functions. But it also contains examples of _VALUE functions using window clauses, and they work when testing
Is that right?
That sums it up well and is my understanding as well.
If that's the case, then I can see a few options for our defaults:
- Inherit SQL defaults
- Everything works now. The defaults are arguably normally the thing folks want. But the defaults are complicated and confusing.
- Default to a full window
- Until we encode the SQL defaults, some dialects won't be able to do LAG / LEAD, because they'll raise an error when we try and supply a window clause.
- ...though maybe we could add a type which will not write a window clause, because we know the default is the full window
- It's a very simple standard
- Try and come up with some better standard
- Based on @aljazerzen 's points, I don't think this is viable — any standard that does what you want most of the time ends up being overly complicated to document and understand
I would probably vote for ~1~ 2: Full Window if we think it's possible to work around using something like types in the near future (and ofc happy to help with the work). I could also see resting on ~2~ 1: Inherit SQL Defaults for a while being reasonable, before later switching.
I vote +2 on the full window (which is also the current behavior).
(I realize my numbers didn't agree with my titles, so fixed above)
Full window was also my vote; do you think it's viable to also let dialects like BQ work with the "Most window functions" group? I think we could even just hardcode them in the compiler to not emit window clauses, assuming there isn't a cleverer way.
I think we can close this as decided, please reopen for any disagreements!