prql icon indicating copy to clipboard operation
prql copied to clipboard

windowing functions: partial:false - option to exclude incomplete window

Open jangorecki opened this issue 2 years ago • 11 comments

Hello, I would like to file a FR for additional argument to windowing functions interface that could control computation for an incomplete (partial) window. My observation is that SQL databases are generally not giving any flexibility in that matter, and always compute and return partial window. On the other hand analytical tools are generally not only giving such flexibility but most of them defaults to not returning partial window results. What behavior is desires depends on the use case. From my personal uses cases, it is definitely the latter one that was useful. My perspective is that if an analysts asks in his/her query for SMA10, then he/she wants SMA10, not an SMA9, SMA8, etc. Therefore whenever SMA10 is not possible to be computed (partial window) then NULL/missing values is expected. Asking for changing the default in prql would be insane, but an option would be highly desired, and it actually could be another advantage of PRQL over SQL. So lets look at the example

library(prqlr)
"from EuStockMarkets | select SMI | derive {rn = row_number this} | sort rn | window rolling:10 (derive {sma10 = average SMI})" |> prql_compile() |> cat()
#WITH table_0 AS (
#  SELECT
#    "SMI",
#    ROW_NUMBER() OVER () AS rn
#  FROM
#    "EuStockMarkets"
#)
#SELECT
#  "SMI",
#  rn,
#  AVG("SMI") OVER (
#    ORDER BY
#      rn ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
#  ) AS sma10
#FROM
#  table_0
#ORDER BY
#  rn

This SQL query will compute SMA10, but also SMA9...SMA1 whenever 10 observations will not be available in the window.

So my FR is about extra arg, lets call it partial

window rolling:10 partial:false (derive {sma10 = average SMI})"

and the SQL generated with partial=false would need to have a subquery and case when

SELECT
  SMI,
  CASE WHEN partial=1 THEN NULL ELSE AVG(SMI) OVER (ORDER BY id) END AS sma10
FROM (
  SELECT
    SMI,
    CASE WHEN rn<10 THEN 1 ELSE 0 END AS partial
  FROM (
    SELECT
      SMI,
      ROW_NUMBER() OVER () AS rn FROM "EuStockMarkets"
  )
)

value added for prql is potentially big, just one attribute partial:false and much less typing than standard SQL way.

jangorecki avatar Oct 07 '23 09:10 jangorecki

Related discussion in duckdb can be found here: https://github.com/duckdb/duckdb/discussions/8340 In case duckdb will provide option for that, then pqrl compiler could for this target use new duckdb option

jangorecki avatar Oct 07 '23 09:10 jangorecki

Hi,

this is a good observation and something I noticed as well when answering your other question.

It's not too difficult to add this in the PRQL query but I agree with you that it might be nice for PRQL to add this as an option as you suggested.

In the meantime you could work around it with the following:

let data = [
  {id=1, value=2},
  {id=2, value=3},
  {id=3, value=4},
  {id=6, value=2},
  {id=5, value=3},
  {id=4, value=4},
  ]

from data
window rolling:3 (
  sort id
  derive {rn = row_number this, sma3 = average value}
  filter rn>=3
  select {data.*, sma3}
  )

snth avatar Oct 07 '23 14:10 snth

Your example won't exactly give what was requested in the question because it filter out rows having partial window. Question asked to just have SMA10 value NULL but still keep those rows. Could you please point me to place in the code where changes would have to happen for 1) adding partial argument to window, 2) adding handling of new argument in compiler.

jangorecki avatar Oct 07 '23 18:10 jangorecki

Ah, I misunderstood your requirement.

The following query should produce the result you want:

let data = [
  {id=1, value=2},
  {id=2, value=3},
  {id=3, value=4},
  {id=6, value=2},
  {id=5, value=3},
  {id=4, value=4},
  ]

from data
window rolling:3 (
  sort id
  derive {rn = row_number this, 
          sma3 = case {rn<3 => null, true => average value}
          }
  select {data.*, sma3}
  )

Please note: case currently still uses {} but this will change to [] in 0.10 (we can't release this currently because it's a breaking change).

I'm not the best person to ask about the compiler. @aljazerzen or @max-sixty , can you answer the following?

Could you please point me to place in the code where changes would have to happen for 1) adding partial argument to window, 2) adding handling of new argument in compiler.

snth avatar Oct 07 '23 19:10 snth

This would be super useful. But I'm not sure how it fits with current window semantics.

To implement it, we'd probably have to hack something up during resolution. Not the easiest.

aljazerzen avatar Oct 12 '23 13:10 aljazerzen

Just a link to a book that compares this: https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

eitsupi avatar Oct 12 '23 13:10 eitsupi

I think the request is for something similar to pandas' min_periods, is that correct?

To develop this, the first thing to do would be to write out the SQL that we'd want to generate. I think this can probably be done with another window function which counts the number of values that are non-null, which we then return NULL for when it's below the min_periods value.

(FWIW I remember writing something a bit like this here, though that's possibly more complicated than this case)

max-sixty avatar Oct 12 '23 19:10 max-sixty

Just a link to a book that compares this: https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

These are really great!!

max-sixty avatar Oct 12 '23 19:10 max-sixty

Just a link to a book that compares this: https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

Very cool!

I also used an SMA like that in my presentations, as well as an EWMA which you can find in this Google Colab if you want to include that as well? https://colab.research.google.com/drive/1asOWKjQbv6VWW9WjWR2Mer7PLmm0S0dT?usp=sharing

snth avatar Oct 13 '23 05:10 snth

I think the request is for something similar to pandas' min_periods, is that correct?

To develop this, the first thing to do would be to write out the SQL that we'd want to generate. I think this can probably be done with another window function which counts the number of values that are non-null, which we then return NULL for when it's below the min_periods value.

(FWIW I remember writing something a bit like this here, though that's possibly more complicated than this case)

yes, min_period in pandas

For a window that is specified by an integer, min_periods will default to the size of the window.

while in SQL it is always 1.

The request is about adding an option for pandas default behavior.

The written SQL is in the first post, no need to count nulls there, just case row_number().

jangorecki avatar Oct 13 '23 05:10 jangorecki

The written SQL is in the first post, no need to count nulls there, just case row_number().

Yes great, thanks.

max-sixty avatar Oct 16 '23 18:10 max-sixty