prql s-string causes a missing aggregate

s-string causes a missing aggregate

Open max-sixty opened this issue 2 years ago • 1 comments

func sum method -> s"sum(method)"

from x
aggregate [
    sum credit_card,
]
select credit_card

compiles to

SELECT
  credit_card
FROM
  x

...which doesn't look correct :)

May 30 '22 04:05 max-sixty

The problem here is not s-string, but how columns are named:

This:

from x
aggregate [
    sum credit_card,
]

... returns a frame of one unnamed column, which cannot be referenced (we don't support referencing columns by index). So when you add:

select credit_card

... I don't know what credit_card resolves to! It shouldn't be the unnamed aggregated column and it shouldn't be x.credit_card. It should be an error.

But I do think that applying a function to a column should keep the column name. Let's call this "variable name propagation". So these two would be equivalent:

select a = floor a
select floor a

TODO:

[x] fix the bug and make the code above return an error "variable credit_card not found"
[ ] add variable name propagation

Jun 02 '22 16:06 aljazerzen

First TODO is now fixed, but let's keep the issue open for tracking "variable name propagation".

Nov 28 '22 10:11 aljazerzen

If we were to follow how databases themselves act, shouldn't select floor a instead be equivalent to select floor = floor a?

select floor(1.2) in an actual DB returns a column named floor:

floor
1

This will return just a round column:

from x

derive precision = 2

select (round precision 0.12345)

So by my logic, the following should also work, but it doesn't:

from x

derive precision = 2

select (round precision 0.12345)
select doubled = _frame.round * 2

Dec 01 '22 12:12 mklopets

If we were to follow how databases themselves act, shouldn't select floor a instead be equivalent to select floor = floor a?

Well we have to decide how do we name columns in PRQL. Currently, names are set with name = and derived from column expressions that are just idents themselves.

There are basically two options for ... | select [round my_column]:

result is named round (what most databases do & what you suggested)
result is named my_column (which I what I meant initially)

Now, I'm not quite sure which option I prefer or if I even want any of them.

This is tangential but worth clarifying:

In the examples you've given, there should be typing errors that have not yet been implemented:

from x
derive precision = 2
select (round precision 0.12345)
              ---------
                \ first arg of function round expected int, found column<int>

Dec 01 '22 12:12 aljazerzen

Tangential indeed, but in the case of using round, why do we need an int rather than column<int>?

Dec 01 '22 15:12 mklopets

Well, because I don't think we can implement using a column for precision in ROUND in SQL.

This doesn't work:

select round(5.678, t.a) FROM (values (1), (2), (3)) AS t(a);

... except that it does, whoops. My bad.

Dec 01 '22 15:12 aljazerzen


select (round precision 0.12345)
              ---------
                \ first arg of function round expected int, found column<int>

This would indeed be great!

Tangential indeed, but in the case of using round, why do we need an int rather than column<int>?

To what extent are those equivalent? Most of the time they are — e.g. in "normal" expressions — foo + 1 or foo + bar. But not for all parameters — e.g. LAG(foo) is not valid, even though ROUND(_, foo) is.

Does that mean we're going to need to find the idiosyncratic type signature of every function to do this fully? I think we can get a lot of the way there without that level of customization, but it seems like the final 10% could be as much work as the first 90%, particularly if they're different between databases

Dec 02 '22 07:12 max-sixty

But I do think that applying a function to a column should keep the column name. Let's call this "variable name propagation". So these two would be equivalent:
select a = floor a
select floor a

Are we sure this would be clearer? A couple of arguments against it:

It's quite easy to just do select a = floor a — there's very little tax!
Possible ambiguities
- How would we handle multiple columns select foo + bar?
- What about user functions? Would the column here be named foo? func f col -> bar + col; select f foo
- How about lists? select [floor a, ceil a]?

The ambiguities are resolvable, but add some complication to the language spec, and possibly to the compiler too.

(edit: I probably should have opened a new issue; given the original case is solved, thanks to @aljazerzen . if there's more discussion here then lmk and I'm happy to do that)

Dec 02 '22 08:12 max-sixty

I agree, it's not that hard to use assigns.

The ambiguities:

for multiple columns, we wouldn't assume any name,
for functions with multiple column input, we wouldn't assume any name,
for duplicate names, the second name would overshadow first name. This would be same as we what we do now for explicit column names.

So I think it is possible to implement both option 1 and 2 without ambiguities, but I think we shouldn't. This feels like the kind of magic that comes around to bite you in the ass.

Dec 02 '22 09:12 aljazerzen

This feels like the kind of magic that comes around to bite you in the ass.

rofl, my thoughts exactly

I'll close then? No permanent verdict obv.

Dec 02 '22 09:12 max-sixty

prql prql copied to clipboard

s-string causes a missing aggregate

prql
prql copied to clipboard