prql
prql copied to clipboard
Notation for creating sample data
One annoyance in SQL is that it's quite difficult to create some data in code. This is generally used for sample tables when demonstrating something, or for quick lookup tables.
IME it ends up being something like:
select tim as first_name, smith as last_name, 30 as age
union all
select mary, jones, 31
union all
...
I think we could make this much nicer. The simplest approach would be to just do from_json (or _csv or _yaml), parse the inline string, and then convert to the verbose SQL representation. As above, it's not for loading lots of data; it's just for small samples.
There's possibly a better approach which could include types / structs / arrays etc, though we're somewhat limited by the constraints of SQL. The thought was inspired by @jamii's post's, worth reading (& subscribing!): https://www.scattered-thoughts.net/writing/the-shape-of-data/
I would rather propose we use from for this:
from data:[
first_name: "tim",
hobby: "HI",
age: 30,
]
join ...
From can then be thought as "data source" either as a table, inline data or csv/json/yml
Also there has to be quotes on strings.
Great, agree with the from (/ table?) syntax.
With the more literal style, how would be represent multiple rows succinctly? It would be convenient to have a format that people can copy from somewhere else too, as opposed to a PRQL-specific format.
I had started making something like:
from csv:r#"
first_name, last_name, age
tim, smith, 30
mary, jones 31
"#
I had started with CSV since it doesn't need quotes (it wouldn't need the r#""# at first), and is more concise than JSON.
WDYT? Do you still prefer the more "PRQL-native" approach?
Hmmm, why not both? PRQL has nice highlighting, CSV is copy-pastable. There are use cases for both.
Problem with CSV is that there is no standard. comma/semicolon/tab/space, header/no header, spaces after comma are all ambiguous in different software.
Before adding r#""# think twice about how will be include regex literals. Also, in your example string literal starts with new line...
Ah, so much problems with such a simple feature.
Before adding r#""# think twice about how will be include regex literals. Also, in your example string literal starts with new line...
I was thinking that r"\w" would be raw text, and the rust approach of r#"foo"# was a good way of extending this to multiple lines! WDYT?
Hmmm, why not both? PRQL has nice highlighting, CSV is copy-pastable. There are use cases for both.
Definitely agree. I do think there can be a temptation to build too many ways of doing something — it's difficult to turn ideas down! But for new users it can be confusing. Generally I would vote to do one thing and then assess if we need another.
Possibly in this case there is a real use for both though.
I do appreciate you keep things clean @max-sixty , I agree with your sentiment of only building one way to do things in the beginning. More ways will come as the language grows but in the beginning if we can keep it light I think it will go far.
That said , I still like the prql approach of from data , but it looks like it would only allow for one row. How many rows do you need in this approach ?
How many rows do you need in this approach ?
Generally for sample data I've needed 5-10...
Ah, in this case we would need nested arrays:
from data:[
[first_name: "tim", hobby: "HI", age: 30],
[first_name: "tim", hobby: "HI", age: 30],
[first_name: "tim", hobby: "HI", age: 30],
]
join ...
but now, CSV looks better because it has headers at the top.
Something I've noticed that I miss from SQL as I'm writing documentation is the ability to start without a table; e.g. if we want to demonstrate an s-string, we need to have something like:
from employees
derive db_version: s"version()"
...whereas in SQL, that's just SELECT version(); no need for the from.
Hopefully as part of this we can make that easier; e.g.
from data:""
derive db_version: s"version()"
...might be sufficient
but now, CSV looks better because it has headers at the top.
CSV is not a great format, but it does have this one advantage! Are there nicer formats that also can have column names specified only once?
SQL has that:
INSERT INTO employees (first_name, age) VALUES
("John", 15),
("Janez", 17),
("Hans", 16)
Pandas has columns paramater of DataFrame constructor. In prql this would look like:
from columns:[first_name, age] data:[
["John", 15],
["Janez", 17],
["Hans", 16],
]
select ...
Also, currently this is a valid query:
derive db_version: s'version()'
SELECT
version() AS db_version
Also, currently this is a valid query:
Ah, I didn't even realize that! That's great!
SQL has that:
INSERT INTO employees (first_name, age) VALUES ("John", 15), ("Janez", 17), ("Hans", 16)
Yes, this is very reasonable. it's a shame that this doesn't extend to inline sample data; and requires table modifications — particularly for analytical DBs where table modifications are expensive & rare.
Postgres has something decent for this: https://www.postgresql.org/docs/current/queries-values.html
Great. This means that we have a few notations and a way to convert that to SQL (at least for postgres).
from data:[] column:[]
from data:[column1:value, column2:value, ... ]
from csv:"column1,column2\nvalue1,value2"
As far as I am concerned, we can go on and implement this.
Great! I am some of the way through it, I paused last week, but I can finish it off!
Then there's
from datasource("file:/C:/data/employees.csv")
...
with schema("https://example.com/data/schema.json")
from datasource("https://example.com/data/employees.json")
...
(!)
Oh, that's right. pyprql had something like this to load a file directly and we should also incorporate it into the language. Although, with should probably also be an argument to from.
Yes, agree, I think the
from `file:/C:/data/employees.csv`
could work now; it compiles to:
SELECT
"file:/C:/data/employees".csv.*
FROM
"file:/C:/data/employees".csv
So possibly we want just SELECT * there, but it's close.
And then the implementation will be up to the engine; duckdb will work now, obv most DBs won't
The schema is a broader question which I don't think we're ready to answer yet
For a generic solution csvq looks pretty good and perhaps could be added as an SQL dialect/back end? Command line tool, MIT license.
@max-sixty I concur your with your concern about the increase in cardinality with from_csv, from_json, from_yaml, ... . I'm also concerned about whether it won't potentially be confusing in databases like DuckDB where things like
SELECT * FROM 'test.csv';
are possible and a user might expect to have to use from_csv in the PRQL case for that?
Given that this is about reading inline data from a string, how about a name like from_inline or from_string and then making the type/kind of data a paramter? Something like the following signature
from_string kind:{csv,json,yaml,...} data
So the original example would look like:
from_string kind:csv """
a,b,c
1,2,3
4,5,6
"""
filter [a > 2]
Other possible names to consider for the kind parameter would be type, format, or encoding.
The DuckDB csv file case was partially covered above with the mention of a 'file://' prefix in the table/URI name. Another option to consider would be a from_file clause/transform which then could take a similar kind parameter to specify csv, json, parquet, ...
That's a great point. I'm +2 on the from_string instead of from or from_csv.
I think that from_string format:csv "..." is the best signature.
And :+1: on from_file, but this if for #someday_maybe
I broadly agree. I do find "string" a bit engineer-y — but maybe "inline" isn't that much better. I'm +1 on from_string unless we can think of a friendlier alternative
from_text?
from_text sounds good.
RE: from_file, I agree that that's not needed now (or ever possibly). The DuckDB case works just fine with what we have at the moment because you can simply put the file path in a quoted table name and DuckDB is perfectly happy with that and just works.
I mean ultimately, we would want to write PRQL without knowing what the engine is - so I'd want to have a common 'from_file' syntax for all databases.
This has been decided and is now matter of implementation, see #1014 for outline of how to implement.
First part (from_text function) is now done. Second part (actual relation literals) can be postponed until we formalize type system, because the syntax of relation literals will probably reflect syntax of type definitions.
By relation literals I mean this:
[[first_name: "tim", hobby: "HI", age: 30]]
select [hobby, age]
Maybe the formats should be defined in modules instead of the core? Maybe it should be possible to augment this with new data kinds from third-party modules?
Maybe someone want to from_tsv, from_ xml, from_bson, from_jsonb, from_excel, or from_rdf?
If you have from_string format:csv "..." then where format is specified as parameter (format:{format_type}) then would that format be possible to argument with new formats defined in modules that gets imported?
@vanillajonathan This has actually been implemented and the following works in the playground:
from_text format:json '[{"a": 1 , "b":2}]'
I believe the following should work too but for some reason I keep having problems with it. Can anyone reproduce the buggy playground behaviour for the example below?
from_text format:csv 'a,b\n1,2'
I see. Maybe it should be possible to extend the formats used by from_text by importing formats from modules?
I find it rather confusing that from_text has no parenthesis, it makes me not know if its a language construct or if it is a function call. It seems from_text is defined in the std.prql file so it seems it is a function.