`range between` seems to not support `lag()`
Cross posted at Stack Overflow.
I want to get the previous value using lag(), over a partition defined with RANGE BETWEEN. I followed an example from the documentation:
WITH orders (custkey, orderdate, totalprice)
AS
(
VALUES
('cust_1', DATE '2020-10-10', 100),
('cust_2', DATE '2020-10-10', 15),
('cust_1', DATE '2020-10-15', 200),
('cust_1', DATE '2020-10-16', 240),
('cust_2', DATE '2020-12-20', 25),
('cust_1', DATE '2020-12-25', 140),
('cust_2', DATE '2021-01-01', 5)
)
SELECT *,
avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW) AS past_month_avg,
lag(totalprice) OVER ( PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW) AS previous_total_price_within_last_month
FROM orders
It returns:
| custkey | orderdate | totalprice | past_month_avg | previous_total_price_within_last_month |
|---|---|---|---|---|
| cust_2 | 2020-10-10 | 15 | 15.0 | NULL |
| cust_2 | 2020-12-20 | 25 | 25.0 | 15 |
| cust_2 | 2021-01-01 | 5 | 15.0 | 25 |
| cust_1 | 2020-10-10 | 100 | 100.0 | NULL |
| cust_1 | 2020-10-15 | 200 | 150.0 | 100 |
| cust_1 | 2020-10-16 | 240 | 180.0 | 200 |
| cust_1 | 2020-12-25 | 140 | 140.0 | 240 |
The problem: while the calculation for past_month_avg results as expected, the previous_total_price_within_last_month result is not as expected.
Expected Output
Given that I defined a window that ranges at the last month, I expect that lag() will return null if the "previous" totalprice value is associated with an orderdate value that is out of the 1-month-back window.
| custkey | orderdate | totalprice | past_month_avg | previous_total_price_within_last_month |
|---|---|---|---|---|
| cust_2 | 2020-10-10 | 15 | 15.0 | NULL |
| cust_2 | 2020-12-20 | 25 | 25.0 | NULL* |
| cust_2 | 2021-01-01 | 5 | 15.0 | 25 |
| cust_1 | 2020-10-10 | 100 | 100.0 | NULL |
| cust_1 | 2020-10-15 | 200 | 150.0 | 100 |
| cust_1 | 2020-10-16 | 240 | 180.0 | 200 |
| cust_1 | 2020-12-25 | 140 | 140.0 | NULL* |
*expected value
Does ranges between not support lag(), or is my example actually demonstrating the expected behavior?
For functions such as lead and lag, the window frame is not supported. Per the SQL specification:
5) If <ntile function>, <lead or lag function>, <rank function type> or ROW_NUMBER
is specified, then:
a) ...
b) The window framing clause of WDX shall not be present.
The invocation should fail if the window frame is present. Currently, it's being ignored. That's a bug.
@emmansh which version of Trino are you running?
I've written a test to verify whether the window frame is supported for the lag function: link. It shows that the current master behaves as expected, returning the error Cannot specify window frame for lag function. The same applies to the lead function.
However, for functions like rank, no such error is returned - this test fails. This is probably something we can improve.
Hi, may i ask is this issue still open? can i work on this?
Hi @yuma-tietiedaxiang, I started looking at this. However, I'm waiting for a response from @emmansh (see my last comment).
@piotrrzysko I use Trino via AWS Athena. I'm not sure what Trino version it uses, but it certainly not the latest. See here why it's a problem to figure out the Trino version on AWS Athena.
I verified this, and indeed, Athena returns the results provided by @emmansh. However, the current version of Trino throws an error for the lag function when used with the window framing clause, which is in accordance with the SQL specification. There is already a test that confirms this, and in #23856, I extended it to verify the lead function. Additionally, I included other functions for which the window framing clause is illegal.