prql
prql copied to clipboard
s-string causes a missing aggregate
func sum method -> s"sum(method)"
from x
aggregate [
sum credit_card,
]
select credit_card
compiles to
SELECT
credit_card
FROM
x
...which doesn't look correct :)
The problem here is not s-string, but how columns are named:
This:
from x
aggregate [
sum credit_card,
]
... returns a frame of one unnamed column, which cannot be referenced (we don't support referencing columns by index). So when you add:
select credit_card
... I don't know what credit_card
resolves to! It shouldn't be the unnamed aggregated column and it shouldn't be x.credit_card. It should be an error.
But I do think that applying a function to a column should keep the column name. Let's call this "variable name propagation". So these two would be equivalent:
select a = floor a
select floor a
TODO:
- [x] fix the bug and make the code above return an error "variable credit_card not found"
- [ ] add variable name propagation
First TODO is now fixed, but let's keep the issue open for tracking "variable name propagation".
If we were to follow how databases themselves act, shouldn't select floor a
instead be equivalent to select floor = floor a
?
select floor(1.2)
in an actual DB returns a column named floor
:
floor |
---|
1 |
This will return just a round
column:
from x
derive precision = 2
select (round precision 0.12345)
So by my logic, the following should also work, but it doesn't:
from x
derive precision = 2
select (round precision 0.12345)
select doubled = _frame.round * 2
If we were to follow how databases themselves act, shouldn't select floor a instead be equivalent to select floor = floor a?
Well we have to decide how do we name columns in PRQL. Currently, names are set with name =
and derived from column expressions that are just idents themselves.
There are basically two options for ... | select [round my_column]
:
- result is named
round
(what most databases do & what you suggested) - result is named
my_column
(which I what I meant initially)
Now, I'm not quite sure which option I prefer or if I even want any of them.
This is tangential but worth clarifying:
In the examples you've given, there should be typing errors that have not yet been implemented:
from x
derive precision = 2
select (round precision 0.12345)
---------
\ first arg of function round expected int, found column<int>
Tangential indeed, but in the case of using round
, why do we need an int
rather than column<int>
?
Well, because I don't think we can implement using a column for precision in ROUND in SQL.
This doesn't work:
select round(5.678, t.a) FROM (values (1), (2), (3)) AS t(a);
... except that it does, whoops. My bad.
select (round precision 0.12345) --------- \ first arg of function round expected int, found column<int>
This would indeed be great!
Tangential indeed, but in the case of using
round
, why do we need anint
rather thancolumn<int>
?
To what extent are those equivalent? Most of the time they are — e.g. in "normal" expressions — foo + 1
or foo + bar
. But not for all parameters — e.g. LAG(foo)
is not valid, even though ROUND(_, foo)
is.
Does that mean we're going to need to find the idiosyncratic type signature of every function to do this fully? I think we can get a lot of the way there without that level of customization, but it seems like the final 10% could be as much work as the first 90%, particularly if they're different between databases
But I do think that applying a function to a column should keep the column name. Let's call this "variable name propagation". So these two would be equivalent:
select a = floor a select floor a
Are we sure this would be clearer? A couple of arguments against it:
- It's quite easy to just do
select a = floor a
— there's very little tax! - Possible ambiguities
- How would we handle multiple columns
select foo + bar
? - What about user functions? Would the column here be named
foo
?func f col -> bar + col; select f foo
- How about lists?
select [floor a, ceil a]
?
- How would we handle multiple columns
The ambiguities are resolvable, but add some complication to the language spec, and possibly to the compiler too.
(edit: I probably should have opened a new issue; given the original case is solved, thanks to @aljazerzen . if there's more discussion here then lmk and I'm happy to do that)
I agree, it's not that hard to use assigns.
The ambiguities:
- for multiple columns, we wouldn't assume any name,
- for functions with multiple column input, we wouldn't assume any name,
- for duplicate names, the second name would overshadow first name. This would be same as we what we do now for explicit column names.
So I think it is possible to implement both option 1 and 2 without ambiguities, but I think we shouldn't. This feels like the kind of magic that comes around to bite you in the ass.
This feels like the kind of magic that comes around to bite you in the ass.
rofl, my thoughts exactly
I'll close then? No permanent verdict obv.