prql icon indicating copy to clipboard operation
prql copied to clipboard

Deterministic/pure functions

Open aljazerzen opened this issue 2 years ago • 0 comments

In SQL, deterministic functions (called STABLE/IMMUTABLE in Postgres) guarantee return same result given same arguments.

An example of non-STABLE function would be RANDOM().

This clashes with our inlining of some column expressions:

Because columns in SELECT cannot reference columns declared in the same SELECT, you often have to repeat yourself. One of the original features of PRQL proposal were variables: a way to declare new columns that can then be used in other column expressions:

[{x = 0}, {x = 1}]
derive a = s"RANDOM()"
derive c = a + 1
filter c == a + 1

... compiles to ...

WITH table_0 AS (
  SELECT 0 AS x
  UNION ALL
  SELECT 1 AS x
)
table_1 AS (
  SELECT
    x,
    RANDOM() AS a,
    RANDOM() + 1 AS c
  FROM table_0
)
SELECT x, a, c
FROM table_1
WHERE c = a + 1

Because RANDOM() in a is inlined into c, RANDOM is invoked twice, giving two different results. This is fine and even desired as long as the function inlined is trivial and stable.

It may look like a contrived example, but there are many use-cases with generating UUIDs running into this problem.

Solution #1

Mark some functions as non-STABLE and change SQL translator not to inline them and instead compute the column in a prior CTE:

WITH table_0 AS (
  SELECT 0 AS x
  UNION ALL
  SELECT 1 AS x
),
table_1 AS (
  SELECT
    x,
    RANDOM() AS a
  FROM table_0
)
table_2 AS (
  SELECT
    x,
    a,
    a + 1 AS c
  FROM table_1
)
SELECT x, a, c
FROM table_2
WHERE c = a + 1

Solution #2

Declare all PRQL functions STABLE. For this to work we would have to contrive an alternative API to SQL's non-STABLE functions.

(this is pseudocode, because we don't have syntax for it yet)

func random seed -> (
  s"SETSEED({seed})";
  s"RANDOM()"
)

from x
derive [
  a = random 0.42,
  b = a - 6,
  c = a,
]
filter a == c

I'm not sure how my function would get translated, maybe something like:

(SELECT SETSEED(0.42)) UNION ALL (SELECT RANDOM()) OFFSET 1

but the important part is that this can then get inlined as much a we want; the result will always be the same.

This is obviously more complicated, but I feel like it's the more "pure functional" way. I know I'm biased toward such patterns, but it does have some pros, like avoiding the whole STABLE-non-STABLE function flags which then crawl along functions all the way trough compiler.

But, looking at the SETSEED sub query, this would have terrible performance if it got inlined many times. Which means that we would also need some kind of optimizer that would prevent inlining complex expressions.


This is not the priority right now, as it can easily be worked around with explicit CTEs:

table x_with_a = (
  from x
  derive [a = random 0.42]
)

from x_with_a
derive [
  b = a - 6,
  c = a,
]
filter a == c

Edit: update to latest PRQL syntax

aljazerzen avatar Nov 05 '22 09:11 aljazerzen