prql
prql copied to clipboard
feat: table literals
This would complete https://github.com/prql/prql/issues/286
It currently would have syntax like:
from inline:csv """
a,b,c
1,2,3
4,5,6
"""
filter [a > 2]
The functions for parsing & translating the expression are done, but I still need to allow from to take more than just a table name. I could either do that here on on top of the semantic branch. The most general approach would be to allow a general expression in the from clause, which would involve lots of changes which may overlap with semantic.
What do folks think of the prql syntax? There are alternatives discussed in #286.
I've rebased and added ad-hoc impl for RQ relation literals. The test I added should be removed, it's just for testing this part of the compiler before frontend is done.
About the frontend: I don't think that we can squeeze this behavior into from, while also having correct type definitions.
Currently, from has this signature:
func from<table> `default_db.source`<table> -> null
We would need a way of having these signatures:
func from<table> `default_db.source`<table> inline:false -> null
func from<table> `default_db.source`<string> inline:true -> null
Some languages support this, but I don't like it that much. I would rather have two different functions:
func from_csv<table> csv<string> -> null
func from_csv<table> csv<string> -> null
I'm certainly OK with trying this. It was nice to have a very small number of transforms; I can imagine if someone wants to add some data they type from <tab> and see options for inline / csv / etc. And if we have from_csv and from_json it's going to increase cardinality.
But OTOH from_csv lexically begins with from! And you're completely right about the types; if we start treating a CSV file as a table name the autocomplete is gonna get extremely confusing. +1 on keeping the type system simple; we can always increase complexity later...
(and thanks for rebasing!)
How about making this a type of table definition at the top rather than a from statement?
Where I usually need this type of thing is for mapping categoricals or enums to a description so it's usually on a RHS of a join. I currently use dbt seed tables for this but these kind of inline definitions would be neat.
What are the pros and cons of implementing this via repeated SELECT vs VALUES statements? I just tried to find some online playgrounds to try this out on and it seems VALUES needs a slightly different syntax for MySQL. I don't know about BigQuery and Snowflake. Looks like VALUES might be less widely supported so SELECT is a better choice.
The following works in Postgresql and Sqlite:
SELECT *
FROM (
VALUES
(1, 'large'),
(2, 'medium'),
(3, 'small')
) as t0
MySQL
SELECT *
FROM (
VALUES
ROW(1, 'large'),
ROW(2, 'medium'),
ROW(3, 'small')
) as t0
How about making this a type of
tabledefinition at the top rather than afromstatement?
This would be valid with from_csv:
table tab = from_csv "a,b,c\n1,2,3"
from tab
select [a + 1, b, c]
Apologies, I hadn't caught up on all the discussion in #286 before my previous comment.
@aljazerzen Thanks for the table example - it's succinct and ergonomic.
@max-sixty I concur your with your concern about the increase in cardinality with from_csv, from_json, from_yaml, ... . I'm also concerned about whether it won't potentially be confusing in databases like DuckDB where things like
SELECT * FROM 'test.csv';
are possible and a user might expect to have to use from_csv in the PRQL case for that?
Given that this is about reading inline data from a string, how about a name like from_inline or from_string and then making the type/kind of data a paramter? Something like the following signature
from_string kind:{csv,json,yaml,...} data
So the original example would look like:
from_string kind:csv """
a,b,c
1,2,3
4,5,6
"""
filter [a > 2]
Other possible names to consider for the kind parameter would be type, format, or encoding.
The DuckDB csv file case was partially covered in #286 with the mention of a 'file://' prefix in the table/URI name. Another option to consider would be a from_file clause/transform which then could take a similar kind parameter to specify csv, json, parquet, ...
@snth Can you copy your comment to issue #286? Let's keep PRs about implementation.
Done
I did a rebase again and prepared some pointers:
-
I suggest storing the literal in
ExprKind::Literal. To do this, addLiteral::Relationand moverq::RelationLiteraltopl::literal -
Add
std::from_csv<table> csv<string> -> null -
This function will get resolved as usual and to prevent it being evaluated to null, we can add special behavior for it here: https://github.com/prql/prql/blob/bfeb72293cfda5fca79f7db10265805c09018f12/prql-compiler/src/semantic/transforms.rs#L24-L28
If we want resolver to be aware of the columns in the csv, we need to somehow set the
Expr::tytoTy::Table.So I suggest that csv is parsed at this stage and converted to
RelationLiteral.When we have
RelationLiteral, we can construct theFramewhich is needed forTy::Table. It needs to contain- vec of
FrameColumn::Single:nameis{csv alias}.{column_name}. If csv expr does not have an alias, default to something, maybe "csv"?expr_idhas to be set to id of the Expr node that represents the frame input (which is the csv string in this case).
- single input, which must be
vec![FrameInput { }]idmust be same asexpr_id,namemust be same as csv alias (or the default)tableis None
If all is well,
from_csvcan now returnExpr { kind: RelationLiteral, ty: Ty::Table, id: csv.id }.At this stage,
cargo run debugshould be working along with name resolution of csv columns downstream. - vec of
-
Lowering from PL into RQ is very similar as table s-strings: https://github.com/prql/prql/blob/bfeb72293cfda5fca79f7db10265805c09018f12/prql-compiler/src/semantic/lowering.rs#L144-L181
First we have to declare literal as new table and add it to
Lowerer::table_buffer, and then create a local instance of that table.RelationLiteral is different from s-strings in declaring columns:
- use
ColumnDeclKind::ExternRef, - don't forget to also add a col decl
ColumnDeclKind::Wildcard, because it may be needed for some queries that just select all columns.
Lowerer::table_columnsis used when a table instance is created, so don't forget to populate it with CIds you just generated.At this stage
cargo run resolveshould be working. - use
-
Translating already done I think, so
cargo runBut I suggest moving
sample_data::sql_of_sample_datatosql::codegen.We should also consider @snth's suggestion of using VALUES via
sql_ast::SetExpr::Values, but we'd had to open as PR to sqlparser to fix it for MySQL (and possibly other dialects).
Thanks for pushing this over the line, sorry I let it linger.
Does it work already? I couldn't immediately get it to work in the playground. I can add some tests and docs if so.
Oh, I forgot to commit a test:
from_csv """a,b,c
1,2,3
4,5,6
"""
select [b, c]
Sorry, in #286 I got the impression we were going to go with from_text?
Whoops, you're right. It's easy to change though.
Awesome @aljazerzen ! It works great