prql icon indicating copy to clipboard operation
prql copied to clipboard

Column renaming for the duration of a function call

Open aljazerzen opened this issue 1 year ago • 9 comments

@snth has recently raised an issue that I remember we have already talked about, but cannot find now. I do think is a pain point and something we should do do better.

While on the topic of functions, one thing that I think is really needed is a way to alias columns in a relation passed to functions. I've thought a bit about this and I don't really know of an elegant way to do that.

Take for example a simple take_smallest function I use in my presentations:

let take_smallest = func n tbl -> (
  from tbl
  sort bytes
  take n
)

from tracks
group {album_id} (
  take_smallest 3
)

This works but bytes is hardcoded and you would really want that to be a parameter of take_smallest.

One workaround would be the following

let take_smallest = func n tbl -> (
  # Requires a _sort_by column to be set
  from tbl
  sort _sort_by
  take n
)

from tracks
group {album_id} (
  derive _sort_by=milliseconds
  take_smallest 3
  select !{_sort_by}
)

That works but you kinda want that derive _sort_by= ... select !{_sort_by} to happen implicitly inside the function definition.

aljazerzen avatar Jul 10 '24 15:07 aljazerzen

I think that a functional language would solve it like this:

let take_smallest = func n key_getter tbl -> (
  from tbl
  sort (key_getter this)
  take n
)

from tracks
group {album_id} (
  take_smallest 3 (func this -> this.milliseconds)
)

This currently fails miserably as functions are not really first-class citizens of PRQL.

aljazerzen avatar Jul 10 '24 15:07 aljazerzen

What about just passing the column as an argument to the function? This seems to work fine for me...

let take_smallest = func n <int> c <scalar> tbl <relation> -> (
  from tbl
  sort c
  take n
)

from tracks
group {album_id} (
  take_smallest 3 bytes
)

Compiled SQL:

WITH table_0 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY album_id
      ORDER BY
        bytes
    ) AS _expr_0
  FROM
    tracks
)
SELECT
  *
FROM
  table_0
WHERE
  _expr_0 <= 3

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

Interestingly, if you omit the <relation> type specification from the function definition, you get a weird "Unknown name" error...

kgutwin avatar Jul 10 '24 15:07 kgutwin

Thanks @kgutwin. I didn't know that could work. What does the <scalar> type represent, here or in the compiler?

Interestingly, I found that the <scalar> type annotation isn't necessary but the <relation> one is.

My minimal working example is:

let take_smallest = func n c tbl<relation> -> (
  from tbl
  sort c
  take n
)

from tracks
group {album_id} (
  take_smallest 3 bytes
)

🚀

snth avatar Jul 11 '24 09:07 snth

This is very cool! There seem to be quite a few corner cases though.

First a fun little example to play around with:

let f = func n c tbl<relation> -> (from tbl | sort c | take n | select c)

let t = (from [{a=1},{a=2},{a=3}] | derive b=-a)

from t
f 1 a     # works

produces

[
  {"a": 1}
]

We can also rename the column, including to c with

let f = func n c tbl<relation> -> (from tbl | sort c | take n | select {c=c})

from t
f 1 b

producing

[
  {"c": -3},
  {"c": -2}
]

So far, so good.

However

let f = func n c tbl<relation> -> (from tbl | sort c | take n | select {c, c=c})

from t
f 1 b

produces

[
  {"c": -3, "c": -3}
]

and SQL

WITH table_0 AS (
  SELECT
    1 AS a
  UNION
  ALL
  SELECT
    2 AS a
  UNION
  ALL
  SELECT
    3 AS a
),
t AS (
  SELECT
    a,
    - a AS b
  FROM
    table_0
)
SELECT
  b AS c,
  b AS c
FROM
  t
ORDER BY
  c
LIMIT
  1

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

I think the expected output would be:

[
  {"b": -3, "c": -3}
]

and SQL

...
SELECT
  b,
  b AS c
FROM
  t
ORDER BY
  c
LIMIT
  1

snth avatar Jul 11 '24 09:07 snth

I tried to see how far I can push this, for example with joins.

Given the following simple setup, would it be possible to parameterise the relation and column being joined?

let t = (from [{a=1},{a=2},{a=3}] | derive b=-a)
let u = (from t | select {a=a+1, b=b+1})

from t
join u (==b)

Unfortunately none of the following worked for me:

Simple:

let j = func s<relation> c r<relation> -> (from r | join s (==c))

from t
j u a

One level of indirection:

let j = func s<relation> c r<relation> -> (
  from r
  derive {_i=c}
  join (from s | derive {_i=c}) (==_i)
)

from t
j u a

Explicit:

let j = func s<relation> c r<relation> -> (
  from r
  derive {_i=c}
  join (from s | derive {_i=c}) (this._i==that._i)
)

from t
j u a

They all produce the following error (which is why I tried that last example):

Error: 
    ╭─[:28:5]
    │
 28 │ j u a
    │     ┬  
    │     ╰── Ambiguous name
    │ 
    │ Help: could be any of: that.u.a, this.t.a
    │ 
    │ Note: available columns: t.a, t.b, u.a, u.b
────╯

snth avatar Jul 11 '24 10:07 snth

Here's an interesting behavior (I was experimenting a bit with some of your examples):

from t
select {b, d=b}

produces:

SELECT
  b,
  b AS d
FROM
  t

But:

let g = func c tbl <relation> -> (
  from tbl
  select {c, d=c}
)
from t
g b

produces:

SELECT
  b AS d,
  b AS d
FROM
  t

That seems like it would be a bug. I would expect the second example to produce the same output as the first.

kgutwin avatar Jul 11 '24 12:07 kgutwin

Oh god, what have we created!

The "passing column into the function" is not something that was designed purposefully - it's an artifact of how resolver is implemented, so it might change in the future. And also, it is not even fully correct as you two have found on many examples :D

aljazerzen avatar Jul 11 '24 13:07 aljazerzen

I'd argue in favor of keeping this sort of behavior (well, fixing the bugs and inconsistencies, of course). It feels really straightforward to pass a column reference to a user-created function, just like you pass column references to standard library functions.

kgutwin avatar Jul 11 '24 13:07 kgutwin

That's the thing: we don't have a clear idea what are references in PRQL; std functions are taking arguments by value (at least in my understanding). We'd need to formalize this notion of column references (that don't have a type), but are just names that the function is supposed to use.

aljazerzen avatar Jul 11 '24 20:07 aljazerzen