prql icon indicating copy to clipboard operation
prql copied to clipboard

window function could use local sort

Open jangorecki opened this issue 2 years ago • 6 comments

What's up?

Having the following data d

  id value
1  1     2
2  2     3
3  3     4
4  6     2
5  5     3
6  4     4

How can I generate following SQL

SELECT
  *,
  AVG(value) OVER (
    ORDER BY
      id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS sma3
FROM
  d

According to documentation it seems that I have to sort whole dataset rather than an input to windowing function, therefore query that is being generated by prql compiler includes an extra unwanted ORDER BY id at the end.

library(prqlr)
d = data.frame(id=c(1,2,3,6,5,4), value=c(2,3,4,2,3,4))
sql = prql_compile("from d | sort id | window rolling:3 (derive {sma3 = average value})")
cat(sql)
#SELECT
#  *,
#  AVG(value) OVER (
#    ORDER BY
#      id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
#  ) AS sma3
#FROM
#  d
#ORDER BY
#  id

How can I define sorting locally for each window function call locally? so I don't have to clutter my SQL query with unwanted clauses.

jangorecki avatar Oct 07 '23 09:10 jangorecki

Hi @jangorecki,

Great to see that you are trying out PRQL and posting questions.

You can put the sort inside the window pipeline.

So for your example:

library(prqlr)
d = data.frame(id=c(1,2,3,6,5,4), value=c(2,3,4,2,3,4))
sql = prql_compile("from d | window rolling:3 (sort id | derive {sma3 = average value})")
cat(sql)

or something that you can execude in the playground:

[{id=1, value=2},
{id=2, value=3},
{id=3, value=4},
{id=6, value=2},
{id=5, value=3},
{id=4, value=4},]
window rolling:3 (
  sort id
  derive {sma3 = average value}
  )

which produces

WITH table_0 AS (
  ...
)
SELECT
  id,
  value,
  AVG(value) OVER (
    ORDER BY
      id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS sma3
FROM
  table_0
ORDER BY
  id

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

snth avatar Oct 07 '23 14:10 snth

Thanks for reply, that make sense, could be mentioned in documentation. Although doesn't seem to address my problem. Not sure if you noticed but your example still produces extra, unneeded, order by at the end. @snth

jangorecki avatar Oct 07 '23 15:10 jangorecki

I agree we could elide the ORDER BY at the end. (i wouldn't prioritize it that highly, but it is a good point)

(Thanks for the excellent reply @snth!)

max-sixty avatar Oct 07 '23 19:10 max-sixty

I reopened the issue because that ORDER BY at the end could potentially cause an issue.

Putting a sort after the window does correctly override the ORDER BY id. However putting the sort before the window produces SQL that I believe is incorrect so I think this is a bug.

PRQL Input

[{id=1, value=2},
{id=2, value=3},
{id=3, value=4},
{id=6, value=2},
{id=5, value=3},
{id=4, value=4},]
sort value
window rolling:3 (
  sort id
  derive {sma3 = average value}
  )

SQL Output

WITH table_0 AS (
  SELECT
    1 AS id,
    2 AS value
  UNION
  ALL
  SELECT
    2 AS id,
    3 AS value
  UNION
  ALL
  SELECT
    3 AS id,
    4 AS value
  UNION
  ALL
  SELECT
    6 AS id,
    2 AS value
  UNION
  ALL
  SELECT
    5 AS id,
    3 AS value
  UNION
  ALL
  SELECT
    4 AS id,
    4 AS value
)
SELECT
  id,
  value,
  AVG(value) OVER (
    ORDER BY
      id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS sma3
FROM
  table_0
ORDER BY
  id

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

Expected SQL Output

WITH table_0 AS (
  SELECT
    1 AS id,
    2 AS value
  UNION
  ALL
  SELECT
    2 AS id,
    3 AS value
  UNION
  ALL
  SELECT
    3 AS id,
    4 AS value
  UNION
  ALL
  SELECT
    6 AS id,
    2 AS value
  UNION
  ALL
  SELECT
    5 AS id,
    3 AS value
  UNION
  ALL
  SELECT
    4 AS id,
    4 AS value
)
SELECT
  id,
  value,
  AVG(value) OVER (
    ORDER BY
      id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS sma3
FROM
  table_0
ORDER BY
  value

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

snth avatar Oct 07 '23 19:10 snth

(for anyone comparing the queries — it's the final term that is incorrect — id vs value — agree this seems to be a bug)

max-sixty avatar Oct 13 '23 05:10 max-sixty

Agree, it is a bug. Just to note that this request asks for generating windowing function having local sort, and at the same time not bloating query with extra global sort, which is completely redundant.

Documentation for presenting use of local sort was already resolved in #3634

jangorecki avatar Oct 13 '23 08:10 jangorecki