prql feat: table literals

This would complete https://github.com/prql/prql/issues/286

It currently would have syntax like:

from inline:csv """
  a,b,c
  1,2,3
  4,5,6
"""
filter [a > 2]

The functions for parsing & translating the expression are done, but I still need to allow from to take more than just a table name. I could either do that here on on top of the semantic branch. The most general approach would be to allow a general expression in the from clause, which would involve lots of changes which may overlap with semantic.

What do folks think of the prql syntax? There are alternatives discussed in #286.

Oct 10 '22 03:10 max-sixty

I've rebased and added ad-hoc impl for RQ relation literals. The test I added should be removed, it's just for testing this part of the compiler before frontend is done.

Nov 28 '22 17:11 aljazerzen

About the frontend: I don't think that we can squeeze this behavior into from, while also having correct type definitions.

Currently, from has this signature:

func from<table> `default_db.source`<table> -> null

We would need a way of having these signatures:

func from<table> `default_db.source`<table> inline:false -> null

func from<table> `default_db.source`<string> inline:true -> null

Some languages support this, but I don't like it that much. I would rather have two different functions:

func from_csv<table> csv<string> -> null

Nov 28 '22 17:11 aljazerzen

func from_csv<table> csv<string> -> null

I'm certainly OK with trying this. It was nice to have a very small number of transforms; I can imagine if someone wants to add some data they type from <tab> and see options for inline / csv / etc. And if we have from_csv and from_json it's going to increase cardinality.

But OTOH from_csv lexically begins with from! And you're completely right about the types; if we start treating a CSV file as a table name the autocomplete is gonna get extremely confusing. +1 on keeping the type system simple; we can always increase complexity later...

Nov 29 '22 09:11 max-sixty

(and thanks for rebasing!)

Nov 29 '22 09:11 max-sixty

How about making this a type of table definition at the top rather than a from statement?

Where I usually need this type of thing is for mapping categoricals or enums to a description so it's usually on a RHS of a join. I currently use dbt seed tables for this but these kind of inline definitions would be neat.

What are the pros and cons of implementing this via repeated SELECT vs VALUES statements? I just tried to find some online playgrounds to try this out on and it seems VALUES needs a slightly different syntax for MySQL. I don't know about BigQuery and Snowflake. Looks like VALUES might be less widely supported so SELECT is a better choice.

The following works in Postgresql and Sqlite:

SELECT *
FROM (
  VALUES 
    (1, 'large'), 
    (2, 'medium'), 
    (3, 'small')
) as t0

MySQL

SELECT *
FROM (
  VALUES 
    ROW(1, 'large'), 
    ROW(2, 'medium'), 
    ROW(3, 'small')
) as t0

Nov 29 '22 10:11 snth

How about making this a type of table definition at the top rather than a from statement?

This would be valid with from_csv:

table tab = from_csv "a,b,c\n1,2,3"

from tab
select [a + 1, b, c]

Nov 29 '22 11:11 aljazerzen

Apologies, I hadn't caught up on all the discussion in #286 before my previous comment.

@aljazerzen Thanks for the table example - it's succinct and ergonomic.

@max-sixty I concur your with your concern about the increase in cardinality with from_csv, from_json, from_yaml, ... . I'm also concerned about whether it won't potentially be confusing in databases like DuckDB where things like

SELECT * FROM 'test.csv';

are possible and a user might expect to have to use from_csv in the PRQL case for that?

Given that this is about reading inline data from a string, how about a name like from_inline or from_string and then making the type/kind of data a paramter? Something like the following signature

from_string kind:{csv,json,yaml,...} data

So the original example would look like:

from_string kind:csv """
  a,b,c
  1,2,3
  4,5,6
"""
filter [a > 2]

Other possible names to consider for the kind parameter would be type, format, or encoding.

The DuckDB csv file case was partially covered in #286 with the mention of a 'file://' prefix in the table/URI name. Another option to consider would be a from_file clause/transform which then could take a similar kind parameter to specify csv, json, parquet, ...

Nov 30 '22 09:11 snth

@snth Can you copy your comment to issue #286? Let's keep PRs about implementation.

Nov 30 '22 09:11 aljazerzen

Done

Nov 30 '22 10:11 snth

I did a rebase again and prepared some pointers:

I suggest storing the literal in ExprKind::Literal. To do this, add Literal::Relation and move rq::RelationLiteral to pl::literal
Add std::from_csv<table> csv<string> -> null
This function will get resolved as usual and to prevent it being evaluated to null, we can add special behavior for it here: https://github.com/prql/prql/blob/bfeb72293cfda5fca79f7db10265805c09018f12/prql-compiler/src/semantic/transforms.rs#L24-L28

If we want resolver to be aware of the columns in the csv, we need to somehow set the Expr::ty to Ty::Table.

So I suggest that csv is parsed at this stage and converted to RelationLiteral.

When we have RelationLiteral, we can construct the Frame which is needed for Ty::Table. It needs to contain
- vec of FrameColumn::Single:
  - name is {csv alias}.{column_name}. If csv expr does not have an alias, default to something, maybe "csv"?
  - expr_id has to be set to id of the Expr node that represents the frame input (which is the csv string in this case).
- single input, which must be vec![FrameInput { }]
  - id must be same as expr_id,
  - name must be same as csv alias (or the default)
  - table is None
If all is well, from_csv can now return Expr { kind: RelationLiteral, ty: Ty::Table, id: csv.id }.

At this stage, cargo run debug should be working along with name resolution of csv columns downstream.
Lowering from PL into RQ is very similar as table s-strings: https://github.com/prql/prql/blob/bfeb72293cfda5fca79f7db10265805c09018f12/prql-compiler/src/semantic/lowering.rs#L144-L181

First we have to declare literal as new table and add it to Lowerer::table_buffer, and then create a local instance of that table.

RelationLiteral is different from s-strings in declaring columns:
- use ColumnDeclKind::ExternRef,
- don't forget to also add a col decl ColumnDeclKind::Wildcard, because it may be needed for some queries that just select all columns.
Lowerer::table_columns is used when a table instance is created, so don't forget to populate it with CIds you just generated.

At this stage cargo run resolve should be working.
Translating already done I think, so cargo run

But I suggest moving sample_data::sql_of_sample_data to sql::codegen.

We should also consider @snth's suggestion of using VALUES via sql_ast::SetExpr::Values, but we'd had to open as PR to sqlparser to fix it for MySQL (and possibly other dialects).

Nov 30 '22 11:11 aljazerzen

Thanks for pushing this over the line, sorry I let it linger.

Does it work already? I couldn't immediately get it to work in the playground. I can add some tests and docs if so.

Jan 12 '23 18:01 max-sixty

Oh, I forgot to commit a test:

from_csv """a,b,c
1,2,3
4,5,6
"""
select [b, c]

Jan 12 '23 19:01 aljazerzen

Sorry, in #286 I got the impression we were going to go with from_text?

Jan 12 '23 19:01 snth

Whoops, you're right. It's easy to change though.

Jan 12 '23 19:01 aljazerzen

Awesome @aljazerzen ! It works great

Jan 13 '23 00:01 max-sixty

prql prql copied to clipboard

feat: table literals

prql
prql copied to clipboard