prql Notation for creating sample data

One annoyance in SQL is that it's quite difficult to create some data in code. This is generally used for sample tables when demonstrating something, or for quick lookup tables.

IME it ends up being something like:

select tim as first_name, smith as last_name, 30 as age
union all
select mary, jones, 31
union all
...

I think we could make this much nicer. The simplest approach would be to just do from_json (or _csv or _yaml), parse the inline string, and then convert to the verbose SQL representation. As above, it's not for loading lots of data; it's just for small samples.

There's possibly a better approach which could include types / structs / arrays etc, though we're somewhat limited by the constraints of SQL. The thought was inspired by @jamii's post's, worth reading (& subscribing!): https://www.scattered-thoughts.net/writing/the-shape-of-data/

Mar 30 '22 22:03 max-sixty

I would rather propose we use from for this:

from data:[
  first_name: "tim",
  hobby: "HI",
  age: 30,
]
join ...

From can then be thought as "data source" either as a table, inline data or csv/json/yml

Also there has to be quotes on strings.

Apr 02 '22 11:04 aljazerzen

Great, agree with the from (/ table?) syntax.

With the more literal style, how would be represent multiple rows succinctly? It would be convenient to have a format that people can copy from somewhere else too, as opposed to a PRQL-specific format.

I had started making something like:

from csv:r#"
first_name, last_name, age
tim, smith, 30
mary, jones 31
"#

I had started with CSV since it doesn't need quotes (it wouldn't need the r#""# at first), and is more concise than JSON.

WDYT? Do you still prefer the more "PRQL-native" approach?

Apr 02 '22 21:04 max-sixty

Hmmm, why not both? PRQL has nice highlighting, CSV is copy-pastable. There are use cases for both.

Problem with CSV is that there is no standard. comma/semicolon/tab/space, header/no header, spaces after comma are all ambiguous in different software.

Before adding r#""# think twice about how will be include regex literals. Also, in your example string literal starts with new line...

Ah, so much problems with such a simple feature.

Apr 02 '22 22:04 aljazerzen

Before adding r#""# think twice about how will be include regex literals. Also, in your example string literal starts with new line...

I was thinking that r"\w" would be raw text, and the rust approach of r#"foo"# was a good way of extending this to multiple lines! WDYT?

Hmmm, why not both? PRQL has nice highlighting, CSV is copy-pastable. There are use cases for both.

Definitely agree. I do think there can be a temptation to build too many ways of doing something — it's difficult to turn ideas down! But for new users it can be confusing. Generally I would vote to do one thing and then assess if we need another.

Possibly in this case there is a real use for both though.

Apr 02 '22 23:04 max-sixty

I do appreciate you keep things clean @max-sixty , I agree with your sentiment of only building one way to do things in the beginning. More ways will come as the language grows but in the beginning if we can keep it light I think it will go far.

That said , I still like the prql approach of from data , but it looks like it would only allow for one row. How many rows do you need in this approach ?

Apr 03 '22 14:04 qharlie

How many rows do you need in this approach ?

Generally for sample data I've needed 5-10...

Apr 03 '22 15:04 max-sixty

Ah, in this case we would need nested arrays:

from data:[
  [first_name: "tim", hobby: "HI", age: 30],
  [first_name: "tim", hobby: "HI", age: 30],
  [first_name: "tim", hobby: "HI", age: 30],
]
join ...

but now, CSV looks better because it has headers at the top.

Apr 04 '22 10:04 aljazerzen

Something I've noticed that I miss from SQL as I'm writing documentation is the ability to start without a table; e.g. if we want to demonstrate an s-string, we need to have something like:

from employees
derive db_version: s"version()"

...whereas in SQL, that's just SELECT version(); no need for the from.

Hopefully as part of this we can make that easier; e.g.

from data:""
derive db_version: s"version()"

...might be sufficient

Apr 10 '22 20:04 max-sixty

but now, CSV looks better because it has headers at the top.

CSV is not a great format, but it does have this one advantage! Are there nicer formats that also can have column names specified only once?

Apr 10 '22 20:04 max-sixty

SQL has that:

INSERT INTO employees (first_name, age) VALUES
("John", 15),
("Janez", 17),
("Hans", 16)

Pandas has columns paramater of DataFrame constructor. In prql this would look like:

from columns:[first_name, age] data:[
  ["John", 15],
  ["Janez", 17],
  ["Hans", 16],
]
select ...

Apr 11 '22 07:04 aljazerzen

Also, currently this is a valid query:

derive db_version: s'version()'

SELECT
  version() AS db_version

Apr 11 '22 07:04 aljazerzen

Also, currently this is a valid query:

Ah, I didn't even realize that! That's great!

SQL has that:

INSERT INTO employees (first_name, age) VALUES
("John", 15),
("Janez", 17),
("Hans", 16)

Yes, this is very reasonable. it's a shame that this doesn't extend to inline sample data; and requires table modifications — particularly for analytical DBs where table modifications are expensive & rare.

Apr 11 '22 19:04 max-sixty

Postgres has something decent for this: https://www.postgresql.org/docs/current/queries-values.html

Apr 15 '22 02:04 max-sixty

Great. This means that we have a few notations and a way to convert that to SQL (at least for postgres).

from data:[] column:[]
from data:[column1:value, column2:value, ...  ]
from csv:"column1,column2\nvalue1,value2"

As far as I am concerned, we can go on and implement this.

Apr 15 '22 06:04 aljazerzen

Great! I am some of the way through it, I paused last week, but I can finish it off!

Apr 15 '22 19:04 max-sixty

Then there's

from datasource("file:/C:/data/employees.csv")
...

with schema("https://example.com/data/schema.json")
from datasource("https://example.com/data/employees.json")
...

(!)

Sep 24 '22 12:09 Barney-x1

Oh, that's right. pyprql had something like this to load a file directly and we should also incorporate it into the language. Although, with should probably also be an argument to from.

Sep 24 '22 17:09 aljazerzen

Yes, agree, I think the

from `file:/C:/data/employees.csv`

could work now; it compiles to:

SELECT
  "file:/C:/data/employees".csv.*
FROM
  "file:/C:/data/employees".csv

So possibly we want just SELECT * there, but it's close.

And then the implementation will be up to the engine; duckdb will work now, obv most DBs won't

The schema is a broader question which I don't think we're ready to answer yet

Sep 25 '22 01:09 max-sixty

For a generic solution csvq looks pretty good and perhaps could be added as an SQL dialect/back end? Command line tool, MIT license.

Sep 25 '22 09:09 Barney-x1

@max-sixty I concur your with your concern about the increase in cardinality with from_csv, from_json, from_yaml, ... . I'm also concerned about whether it won't potentially be confusing in databases like DuckDB where things like

SELECT * FROM 'test.csv';

are possible and a user might expect to have to use from_csv in the PRQL case for that?

Given that this is about reading inline data from a string, how about a name like from_inline or from_string and then making the type/kind of data a paramter? Something like the following signature

from_string kind:{csv,json,yaml,...} data

So the original example would look like:

from_string kind:csv """
  a,b,c
  1,2,3
  4,5,6
"""
filter [a > 2]

Other possible names to consider for the kind parameter would be type, format, or encoding.

The DuckDB csv file case was partially covered above with the mention of a 'file://' prefix in the table/URI name. Another option to consider would be a from_file clause/transform which then could take a similar kind parameter to specify csv, json, parquet, ...

Nov 30 '22 10:11 snth

That's a great point. I'm +2 on the from_string instead of from or from_csv.

I think that from_string format:csv "..." is the best signature.

And :+1: on from_file, but this if for #someday_maybe

Nov 30 '22 11:11 aljazerzen

I broadly agree. I do find "string" a bit engineer-y — but maybe "inline" isn't that much better. I'm +1 on from_string unless we can think of a friendlier alternative

Nov 30 '22 19:11 max-sixty

from_text?

Dec 01 '22 08:12 aljazerzen

from_text sounds good.

RE: from_file, I agree that that's not needed now (or ever possibly). The DuckDB case works just fine with what we have at the moment because you can simply put the file path in a quoted table name and DuckDB is perfectly happy with that and just works.

Dec 01 '22 08:12 snth

I mean ultimately, we would want to write PRQL without knowing what the engine is - so I'd want to have a common 'from_file' syntax for all databases.

Dec 01 '22 09:12 aljazerzen

This has been decided and is now matter of implementation, see #1014 for outline of how to implement.

Dec 23 '22 13:12 aljazerzen

First part (from_text function) is now done. Second part (actual relation literals) can be postponed until we formalize type system, because the syntax of relation literals will probably reflect syntax of type definitions.

By relation literals I mean this:

[[first_name: "tim", hobby: "HI", age: 30]]
select [hobby, age]

Jan 19 '23 11:01 aljazerzen

Maybe the formats should be defined in modules instead of the core? Maybe it should be possible to augment this with new data kinds from third-party modules?

Maybe someone want to from_tsv, from_ xml, from_bson, from_jsonb, from_excel, or from_rdf?

If you have from_string format:csv "..." then where format is specified as parameter (format:{format_type}) then would that format be possible to argument with new formats defined in modules that gets imported?

Feb 10 '23 20:02 vanillajonathan

@vanillajonathan This has actually been implemented and the following works in the playground:

from_text format:json '[{"a": 1 , "b":2}]'

I believe the following should work too but for some reason I keep having problems with it. Can anyone reproduce the buggy playground behaviour for the example below?

from_text format:csv 'a,b\n1,2'

Feb 10 '23 21:02 snth

I see. Maybe it should be possible to extend the formats used by from_text by importing formats from modules?

I find it rather confusing that from_text has no parenthesis, it makes me not know if its a language construct or if it is a function call. It seems from_text is defined in the std.prql file so it seems it is a function.

Feb 10 '23 21:02 vanillajonathan

prql prql copied to clipboard

Notation for creating sample data

prql
prql copied to clipboard