prql
prql copied to clipboard
Should compiler strive to evaulate expressions?
We could make compiler smarter and try to compute some of data transformations at the compile time. But it would reduce similarity between input PRQL and output SQL, so I'm not sure if it's worth it.
There is much we can do with compile-time evaluation (we could basically write a turing-complete interpreter), but simple cases start with:
derive [5 * 15 - 3]
SELECT 42
My use case would be:
let fuel_consumption = 4.8 # L per 100km
let fuel_price = 1.95 # EUR per L
from trips
derive fuel_cost = distance_km * fuel_consumption / 100 * fuel_price
SELECT distance_km * 0.0936 as fuel_cost FROM trips
Also there is this issue.
This would be possible for any values that are known at compile-time; thus mostly constants or function parameters.
Now, I know that the amount of work we could offload with this would be minuscule, because majority of the work happens when iterating over rows - which we cannot compute in advance, since we don't know the contents of the rows. Even further, there would probably be no performance improvement on real databases, because they (probably) already evaluate scalar expressions before applying them to all the rows.
But in compiler, this would be very easy to do - we already have to infer types of binary operations so would only have to:
- check if an operation uses only constant operands (they don't reference any columns or s-strings)
- evaluate the operation
As PRQL functions are already materialized during the compilation, that would also be included.
But the question is not whether we can, but whether we should we implement that? Arguably, having output be similar to input is a big help with debugging. If implemented, it must have a --no-eval flag to disable it.
The only upside I can think of is that some databases may not support all operations PRQL does, for example date arithmetic:
let today_started = @2022-06-23T04:36+02
let today_finished = @2022-06-23T11:44+01
from hikes
filter duration_min < (to_min today_finished - today_started)
SELECT * FROM hikes WHERE duration_min < 488
So, as I'm done writing this, I realize that this is quite an unnecessary feature and there is little point investing time into it. Let's just leave it here for future generations.
I agree with both the question and conclusion — I think we should try and do the minimum possible, and offload everything else to the underlying DB. As you say — it's easier to debug — and we also compile on every keystroke, but only execute the query once, so we want to do as little as possible when compiling.
The only upside I can think of is that some databases may not support all operations PRQL does, for example date arithmetic
Yes, possibly. Though compiling into DATEDIFF is generally possible for most DBs, I think? And doing it regardless of whether it's a scalar or column means the compiler doesn't need to differentiate between those two
Let's just leave it here for future generations.
Yes, we could put this sort of principle somewhere — i.e. delegate as much to the DB as possible. Maybe we start an ARCHITECTURE.md doc — I find those quite useful, and our architecture is becoming much larger!
Maybe this would be useful if you want a REPL or print function #1773