prql icon indicating copy to clipboard operation
prql copied to clipboard

Caching relations

Open Barney-x1 opened this issue 3 years ago • 4 comments
trafficstars

I couldn't see anything saying this was supported, but I might have missed something.

Selecting interim results to a temp table is extremely useful when creating complex queries, and enables reuse in sets of related queries. It also negates the need for the dreaded nested query.

I appreciate that database writes are deprioritised for PRQL (good decision!) but select into is really a query tool, not a conventional database "write".

Possible syntax would be (as the last line of a query) into new table_name. The "new" would make a distinction between this and "INSERT INTO SELECT" (insert in existing table). Whether to support

  1. non temp tables
  2. insert into existing table (syntax into table_name)

at this stage is debateable. They are not really query-support functions like insertion into a temp table, but it seems perverse not to support them if the extra work is minimal.

If permanent tables are supported (or may be supported in future), there needs to be a way to distinguish this in the into clause, either into new temp table_name or perhaps into new #table_name (that being the convention for temp tables in some SQL dialects).

There is also the question of whether to support global temporary tables... local is by far the more common use case so probably not at present.

NB: "CREATE [TEMPORARY] TABLE ... AS SELECT ..." is the syntax in Oracle, MySQL & maybe others for what is "SELECT INTO" in SqlServer.

Barney-x1 avatar Sep 14 '22 16:09 Barney-x1

Thanks @Barney-x1 !

I agree this would be useful. I don't think we have a sufficiently formed view on whether PRQL should branch into DDL.

In the meantime, one approach to allow experimentation is to extend s-strings; e.g. starting a query off with s"CREATE TABLE ... AS SELECT", which is then included in the output SQL. We'd have to think through how to handle an s-string line that's in the middle of a PRQL query (here it would just be the first & last line).

It's also possible that we delegate this to another tool — e.g. @snth is writing something that interacts with databases, and dbt-prql handles this already through dbt

max-sixty avatar Sep 14 '22 23:09 max-sixty

I am generally supportive of this proposal. At first glance it would seem that this would provide an ability to materialise the output without having to get into the weeds of DDL. It would also seem to be in the spirit of dbt because this is largely what dbt does (admittedly behind the scenes and transparently to the user so you never need more than the query functionality of PRQL).


In my db tool I actually hacked in something similar for the DuckDB backend. It would be useful to have this generally available as it would allow the data to stay server-side in the database. At the moment my tool can only materialise data on the client side.

snth avatar Sep 22 '22 20:09 snth

I think there probably lots of features which could be implemented much more easily by internally using temp tables. In fact, to do things portably some features practically require temporary tables to implement. (I think SQL is only Turing complete with writes, so you can literally do anything in SQL whatever the dialect, but not without creating tables or writing to them.)

Foreign key chasing a la kdb+ is one feature which temp tables greatly simplifies. I'd really love to see that implemented in PRQL, its pretty much the holy grail of query simplification. Are there any plans to add it?

I think temporary tables should just be regarded as local variables, not materially different from declaring a List<Employee> (say) in a general purpose language. You can write arbitrarily complex code without local variables to store interim results, with a single expression as long as your arm. But you shouldn't!

In fact, to make it clear that this is what you are doing, perhaps the syntax should be

derive interim_table
from employees
join ...
...

rather than

from employees
join ...
...
into new interim_table

Although the "into" version fits better with the pipeline model of PRQL, it's sort of odd to declare a local variable with the variable name as the last word in the expression. And derive ... from... sounds very natural. into interim_table could (as the last line) be used to insert further records into the temp table.

It occurs to me you'd also need to add a "drop" function (probably just the single word "cleanup") which drops the temporary tables, otherwise you couldn't run the query again. Or better, whenever a query references a previously used ("in scope") temp table, automagically drop it first. (You wouldn't strictly need to drop them if each query used a different mangled name for the temp table, but until you closed your connection there could be dozens of temp tables lying around consuming resources.)

Barney-x1 avatar Sep 24 '22 12:09 Barney-x1

This is a must-have, great issue report. As PRQL is oriented toward analytical queries, specifying a temporary table must be built into the language.

This is very similar to CTEs, though. The difference is that CTE is accessible from a single query and will never be written to disk (at least in Postgres).

Instead of into syntax, maybe we could have cache function?

table interim_table = cache within:session (
  from employees
  join ...
)

or equivalently:

table interim_table = (
  from employees
  join ...
  cache within:session
)

within param could handle a few options:

  • session which would translate to CRATE TEMPORARY TABLE
  • transaction (default) which would translate to CREATE TEMPORARY TABLE ON COMMIT DROP
  • query which would translate to a WITH MATERIALIZED
  • none which would translate to WITH NOT MATERIALIZED or be even inlined into FROM and JOIN clauses.

The usage of these options and cache function itself is very depended on the query executor. Right now, the compiler considers a single PRQL string as input and produces one output query, so the responsibility of knowing whether or not the TEMP TABLE already exists or if it must be recreated falls on whatever is executing the query. Right now, we are content with that and redirect anyone who wants better control toward dbt, which handles this very well.

(I did not talk about INSERT INTO SELECT, let's talk about that in a separate issue for about mutation queries)

aljazerzen avatar Oct 28 '22 13:10 aljazerzen