prql icon indicating copy to clipboard operation
prql copied to clipboard

s-string causes a missing aggregate

Open max-sixty opened this issue 2 years ago • 1 comments

func sum method -> s"sum(method)"

from x
aggregate [
    sum credit_card,
]
select credit_card

compiles to

SELECT
  credit_card
FROM
  x

...which doesn't look correct :)

max-sixty avatar May 30 '22 04:05 max-sixty

The problem here is not s-string, but how columns are named:

This:

from x
aggregate [
    sum credit_card,
]

... returns a frame of one unnamed column, which cannot be referenced (we don't support referencing columns by index). So when you add:

select credit_card

... I don't know what credit_card resolves to! It shouldn't be the unnamed aggregated column and it shouldn't be x.credit_card. It should be an error.

But I do think that applying a function to a column should keep the column name. Let's call this "variable name propagation". So these two would be equivalent:

select a = floor a
select floor a

TODO:

  • [x] fix the bug and make the code above return an error "variable credit_card not found"
  • [ ] add variable name propagation

aljazerzen avatar Jun 02 '22 16:06 aljazerzen

First TODO is now fixed, but let's keep the issue open for tracking "variable name propagation".

aljazerzen avatar Nov 28 '22 10:11 aljazerzen

If we were to follow how databases themselves act, shouldn't select floor a instead be equivalent to select floor = floor a?

select floor(1.2) in an actual DB returns a column named floor:

floor
1

This will return just a round column:

from x

derive precision = 2

select (round precision 0.12345)

So by my logic, the following should also work, but it doesn't:

from x

derive precision = 2

select (round precision 0.12345)
select doubled = _frame.round * 2

mklopets avatar Dec 01 '22 12:12 mklopets

If we were to follow how databases themselves act, shouldn't select floor a instead be equivalent to select floor = floor a?

Well we have to decide how do we name columns in PRQL. Currently, names are set with name = and derived from column expressions that are just idents themselves.

There are basically two options for ... | select [round my_column]:

  • result is named round (what most databases do & what you suggested)
  • result is named my_column (which I what I meant initially)

Now, I'm not quite sure which option I prefer or if I even want any of them.


This is tangential but worth clarifying:

In the examples you've given, there should be typing errors that have not yet been implemented:

from x
derive precision = 2
select (round precision 0.12345)
              ---------
                \ first arg of function round expected int, found column<int>

aljazerzen avatar Dec 01 '22 12:12 aljazerzen

Tangential indeed, but in the case of using round, why do we need an int rather than column<int>?

mklopets avatar Dec 01 '22 15:12 mklopets

Well, because I don't think we can implement using a column for precision in ROUND in SQL.

This doesn't work:

select round(5.678, t.a) FROM (values (1), (2), (3)) AS t(a);

... except that it does, whoops. My bad.

aljazerzen avatar Dec 01 '22 15:12 aljazerzen


select (round precision 0.12345)
              ---------
                \ first arg of function round expected int, found column<int>

This would indeed be great!


Tangential indeed, but in the case of using round, why do we need an int rather than column<int>?

To what extent are those equivalent? Most of the time they are — e.g. in "normal" expressions — foo + 1 or foo + bar. But not for all parameters — e.g. LAG(foo) is not valid, even though ROUND(_, foo) is.

Does that mean we're going to need to find the idiosyncratic type signature of every function to do this fully? I think we can get a lot of the way there without that level of customization, but it seems like the final 10% could be as much work as the first 90%, particularly if they're different between databases

max-sixty avatar Dec 02 '22 07:12 max-sixty

But I do think that applying a function to a column should keep the column name. Let's call this "variable name propagation". So these two would be equivalent:

select a = floor a
select floor a

Are we sure this would be clearer? A couple of arguments against it:

  • It's quite easy to just do select a = floor a — there's very little tax!
  • Possible ambiguities
    • How would we handle multiple columns select foo + bar?
    • What about user functions? Would the column here be named foo? func f col -> bar + col; select f foo
    • How about lists? select [floor a, ceil a]?

The ambiguities are resolvable, but add some complication to the language spec, and possibly to the compiler too.


(edit: I probably should have opened a new issue; given the original case is solved, thanks to @aljazerzen . if there's more discussion here then lmk and I'm happy to do that)

max-sixty avatar Dec 02 '22 08:12 max-sixty

I agree, it's not that hard to use assigns.

The ambiguities:

  • for multiple columns, we wouldn't assume any name,
  • for functions with multiple column input, we wouldn't assume any name,
  • for duplicate names, the second name would overshadow first name. This would be same as we what we do now for explicit column names.

So I think it is possible to implement both option 1 and 2 without ambiguities, but I think we shouldn't. This feels like the kind of magic that comes around to bite you in the ass.

aljazerzen avatar Dec 02 '22 09:12 aljazerzen

This feels like the kind of magic that comes around to bite you in the ass.

rofl, my thoughts exactly

I'll close then? No permanent verdict obv.

max-sixty avatar Dec 02 '22 09:12 max-sixty