prql
prql copied to clipboard
Caching relations
I couldn't see anything saying this was supported, but I might have missed something.
Selecting interim results to a temp table is extremely useful when creating complex queries, and enables reuse in sets of related queries. It also negates the need for the dreaded nested query.
I appreciate that database writes are deprioritised for PRQL (good decision!) but select into is really a query tool, not a conventional database "write".
Possible syntax would be (as the last line of a query) into new table_name. The "new" would make a distinction between this and "INSERT INTO SELECT" (insert in existing table). Whether to support
- non temp tables
- insert into existing table (syntax
into table_name)
at this stage is debateable. They are not really query-support functions like insertion into a temp table, but it seems perverse not to support them if the extra work is minimal.
If permanent tables are supported (or may be supported in future), there needs to be a way to distinguish this in the into clause, either into new temp table_name or perhaps into new #table_name (that being the convention for temp tables in some SQL dialects).
There is also the question of whether to support global temporary tables... local is by far the more common use case so probably not at present.
NB: "CREATE [TEMPORARY] TABLE ... AS SELECT ..." is the syntax in Oracle, MySQL & maybe others for what is "SELECT INTO" in SqlServer.
Thanks @Barney-x1 !
I agree this would be useful. I don't think we have a sufficiently formed view on whether PRQL should branch into DDL.
In the meantime, one approach to allow experimentation is to extend s-strings; e.g. starting a query off with s"CREATE TABLE ... AS SELECT", which is then included in the output SQL. We'd have to think through how to handle an s-string line that's in the middle of a PRQL query (here it would just be the first & last line).
It's also possible that we delegate this to another tool — e.g. @snth is writing something that interacts with databases, and dbt-prql handles this already through dbt
I am generally supportive of this proposal. At first glance it would seem that this would provide an ability to materialise the output without having to get into the weeds of DDL. It would also seem to be in the spirit of dbt because this is largely what dbt does (admittedly behind the scenes and transparently to the user so you never need more than the query functionality of PRQL).
In my db tool I actually hacked in something similar for the DuckDB backend. It would be useful to have this generally available as it would allow the data to stay server-side in the database. At the moment my tool can only materialise data on the client side.
I think there probably lots of features which could be implemented much more easily by internally using temp tables. In fact, to do things portably some features practically require temporary tables to implement. (I think SQL is only Turing complete with writes, so you can literally do anything in SQL whatever the dialect, but not without creating tables or writing to them.)
Foreign key chasing a la kdb+ is one feature which temp tables greatly simplifies. I'd really love to see that implemented in PRQL, its pretty much the holy grail of query simplification. Are there any plans to add it?
I think temporary tables should just be regarded as local variables, not materially different from declaring a List<Employee> (say) in a general purpose language. You can write arbitrarily complex code without local variables to store interim results, with a single expression as long as your arm. But you shouldn't!
In fact, to make it clear that this is what you are doing, perhaps the syntax should be
derive interim_table
from employees
join ...
...
rather than
from employees
join ...
...
into new interim_table
Although the "into" version fits better with the pipeline model of PRQL, it's sort of odd to declare a local variable with the variable name as the last word in the expression. And derive ... from... sounds very natural. into interim_table could (as the last line) be used to insert further records into the temp table.
It occurs to me you'd also need to add a "drop" function (probably just the single word "cleanup") which drops the temporary tables, otherwise you couldn't run the query again. Or better, whenever a query references a previously used ("in scope") temp table, automagically drop it first. (You wouldn't strictly need to drop them if each query used a different mangled name for the temp table, but until you closed your connection there could be dozens of temp tables lying around consuming resources.)
This is a must-have, great issue report. As PRQL is oriented toward analytical queries, specifying a temporary table must be built into the language.
This is very similar to CTEs, though. The difference is that CTE is accessible from a single query and will never be written to disk (at least in Postgres).
Instead of into syntax, maybe we could have cache function?
table interim_table = cache within:session (
from employees
join ...
)
or equivalently:
table interim_table = (
from employees
join ...
cache within:session
)
within param could handle a few options:
sessionwhich would translate toCRATE TEMPORARY TABLEtransaction(default) which would translate toCREATE TEMPORARY TABLE ON COMMIT DROPquerywhich would translate to aWITH MATERIALIZEDnonewhich would translate toWITH NOT MATERIALIZEDor be even inlined into FROM and JOIN clauses.
The usage of these options and cache function itself is very depended on the query executor. Right now, the compiler considers a single PRQL string as input and produces one output query, so the responsibility of knowing whether or not the TEMP TABLE already exists or if it must be recreated falls on whatever is executing the query. Right now, we are content with that and redirect anyone who wants better control toward dbt, which handles this very well.
(I did not talk about INSERT INTO SELECT, let's talk about that in a separate issue for about mutation queries)