prql icon indicating copy to clipboard operation
prql copied to clipboard

feat: table literals

Open max-sixty opened this issue 3 years ago • 10 comments

This would complete https://github.com/prql/prql/issues/286

It currently would have syntax like:

from inline:csv """
  a,b,c
  1,2,3
  4,5,6
"""
filter [a > 2]

The functions for parsing & translating the expression are done, but I still need to allow from to take more than just a table name. I could either do that here on on top of the semantic branch. The most general approach would be to allow a general expression in the from clause, which would involve lots of changes which may overlap with semantic.

What do folks think of the prql syntax? There are alternatives discussed in #286.

max-sixty avatar Oct 10 '22 03:10 max-sixty

I've rebased and added ad-hoc impl for RQ relation literals. The test I added should be removed, it's just for testing this part of the compiler before frontend is done.

aljazerzen avatar Nov 28 '22 17:11 aljazerzen

About the frontend: I don't think that we can squeeze this behavior into from, while also having correct type definitions.

Currently, from has this signature:

func from<table> `default_db.source`<table> -> null

We would need a way of having these signatures:

func from<table> `default_db.source`<table> inline:false -> null

func from<table> `default_db.source`<string> inline:true -> null

Some languages support this, but I don't like it that much. I would rather have two different functions:

func from_csv<table> csv<string> -> null

aljazerzen avatar Nov 28 '22 17:11 aljazerzen

func from_csv<table> csv<string> -> null

I'm certainly OK with trying this. It was nice to have a very small number of transforms; I can imagine if someone wants to add some data they type from <tab> and see options for inline / csv / etc. And if we have from_csv and from_json it's going to increase cardinality.

But OTOH from_csv lexically begins with from! And you're completely right about the types; if we start treating a CSV file as a table name the autocomplete is gonna get extremely confusing. +1 on keeping the type system simple; we can always increase complexity later...

max-sixty avatar Nov 29 '22 09:11 max-sixty

(and thanks for rebasing!)

max-sixty avatar Nov 29 '22 09:11 max-sixty

How about making this a type of table definition at the top rather than a from statement?

Where I usually need this type of thing is for mapping categoricals or enums to a description so it's usually on a RHS of a join. I currently use dbt seed tables for this but these kind of inline definitions would be neat.


What are the pros and cons of implementing this via repeated SELECT vs VALUES statements? I just tried to find some online playgrounds to try this out on and it seems VALUES needs a slightly different syntax for MySQL. I don't know about BigQuery and Snowflake. Looks like VALUES might be less widely supported so SELECT is a better choice.

The following works in Postgresql and Sqlite:

SELECT *
FROM (
  VALUES 
    (1, 'large'), 
    (2, 'medium'), 
    (3, 'small')
) as t0

MySQL

SELECT *
FROM (
  VALUES 
    ROW(1, 'large'), 
    ROW(2, 'medium'), 
    ROW(3, 'small')
) as t0

snth avatar Nov 29 '22 10:11 snth

How about making this a type of table definition at the top rather than a from statement?

This would be valid with from_csv:

table tab = from_csv "a,b,c\n1,2,3"

from tab
select [a + 1, b, c]

aljazerzen avatar Nov 29 '22 11:11 aljazerzen

Apologies, I hadn't caught up on all the discussion in #286 before my previous comment.


@aljazerzen Thanks for the table example - it's succinct and ergonomic.


@max-sixty I concur your with your concern about the increase in cardinality with from_csv, from_json, from_yaml, ... . I'm also concerned about whether it won't potentially be confusing in databases like DuckDB where things like

SELECT * FROM 'test.csv';

are possible and a user might expect to have to use from_csv in the PRQL case for that?

Given that this is about reading inline data from a string, how about a name like from_inline or from_string and then making the type/kind of data a paramter? Something like the following signature

from_string kind:{csv,json,yaml,...} data

So the original example would look like:

from_string kind:csv """
  a,b,c
  1,2,3
  4,5,6
"""
filter [a > 2]

Other possible names to consider for the kind parameter would be type, format, or encoding.


The DuckDB csv file case was partially covered in #286 with the mention of a 'file://' prefix in the table/URI name. Another option to consider would be a from_file clause/transform which then could take a similar kind parameter to specify csv, json, parquet, ...

snth avatar Nov 30 '22 09:11 snth

@snth Can you copy your comment to issue #286? Let's keep PRs about implementation.

aljazerzen avatar Nov 30 '22 09:11 aljazerzen

Done

snth avatar Nov 30 '22 10:11 snth

I did a rebase again and prepared some pointers:

  1. I suggest storing the literal in ExprKind::Literal. To do this, add Literal::Relation and move rq::RelationLiteral to pl::literal

  2. Add std::from_csv<table> csv<string> -> null

  3. This function will get resolved as usual and to prevent it being evaluated to null, we can add special behavior for it here: https://github.com/prql/prql/blob/bfeb72293cfda5fca79f7db10265805c09018f12/prql-compiler/src/semantic/transforms.rs#L24-L28

    If we want resolver to be aware of the columns in the csv, we need to somehow set the Expr::ty to Ty::Table.

    So I suggest that csv is parsed at this stage and converted to RelationLiteral.

    When we have RelationLiteral, we can construct the Frame which is needed for Ty::Table. It needs to contain

    • vec of FrameColumn::Single:
      • name is {csv alias}.{column_name}. If csv expr does not have an alias, default to something, maybe "csv"?
      • expr_id has to be set to id of the Expr node that represents the frame input (which is the csv string in this case).
    • single input, which must be vec![FrameInput { }]
      • id must be same as expr_id,
      • name must be same as csv alias (or the default)
      • table is None

    If all is well, from_csv can now return Expr { kind: RelationLiteral, ty: Ty::Table, id: csv.id }.

    At this stage, cargo run debug should be working along with name resolution of csv columns downstream.

  4. Lowering from PL into RQ is very similar as table s-strings: https://github.com/prql/prql/blob/bfeb72293cfda5fca79f7db10265805c09018f12/prql-compiler/src/semantic/lowering.rs#L144-L181

    First we have to declare literal as new table and add it to Lowerer::table_buffer, and then create a local instance of that table.

    RelationLiteral is different from s-strings in declaring columns:

    • use ColumnDeclKind::ExternRef,
    • don't forget to also add a col decl ColumnDeclKind::Wildcard, because it may be needed for some queries that just select all columns.

    Lowerer::table_columns is used when a table instance is created, so don't forget to populate it with CIds you just generated.

    At this stage cargo run resolve should be working.

  5. Translating already done I think, so cargo run

    But I suggest moving sample_data::sql_of_sample_data to sql::codegen.

    We should also consider @snth's suggestion of using VALUES via sql_ast::SetExpr::Values, but we'd had to open as PR to sqlparser to fix it for MySQL (and possibly other dialects).

aljazerzen avatar Nov 30 '22 11:11 aljazerzen

Thanks for pushing this over the line, sorry I let it linger.

Does it work already? I couldn't immediately get it to work in the playground. I can add some tests and docs if so.

max-sixty avatar Jan 12 '23 18:01 max-sixty

Oh, I forgot to commit a test:

from_csv """a,b,c
1,2,3
4,5,6
"""
select [b, c]

aljazerzen avatar Jan 12 '23 19:01 aljazerzen

Sorry, in #286 I got the impression we were going to go with from_text?

snth avatar Jan 12 '23 19:01 snth

Whoops, you're right. It's easy to change though.

aljazerzen avatar Jan 12 '23 19:01 aljazerzen

Awesome @aljazerzen ! It works great

max-sixty avatar Jan 13 '23 00:01 max-sixty