prql
prql copied to clipboard
Add sampling extension support to PRQL
What's up?
It would be great to handle table sampling in PRQL.
Examples:
I really like this idea, it could come in handy!
I wondered if it would be possible to approximate this using a PRQL function; that might allow you to try it out before it's implemented natively within PRQL. I put this function together which works! (sorta... see below)
let sample = func
nrows <float>
tbl <relation>
-> <relation> (
from s"SELECT * FROM {tbl} USING SAMPLE {nrows}"
)
from invoices
sample 20
select { customer_id }
This outputs the following SQL:
WITH table_0 AS (
SELECT
*
FROM
invoices USING SAMPLE 20
)
SELECT
customer_id
FROM
table_0
-- Generated by PRQL compiler version:0.11.3 (https://prql-lang.org)
However, I start to get strange errors if I modify the pipeline preceding the sample call. If I move the select { customer_id } before the sample 20, I get the error:
from invoices
select { customer_id }
sample 20
Error:
╭─[:9:1]
│
9 │ select { customer_id }
│ ───────────┬──────────
│ ╰──────────── unexpected ``(Select ...)``
│
│ Help: this is probably a 'bad type' error (we are working on that)
───╯
Perhaps I'm using the S-string wrong somehow... does anyone have suggestions?
@kgutwin great use of s-strings!
The one change to fix that error is:
let sample = func
- nrows <float>
+ nrows <int>
tbl <relation>
-> <relation> (
from s"SELECT * FROM {tbl} USING SAMPLE {nrows}"
)
I agree sample is useful. There are a lot of options for sample in DuckDB, which makes it a bit harder to add (though still tractable...). Possibly we could add it to the docs as a nice example of s-strings initially.
Would be very open to a PR for either!
We need to do a survey of supported features in these databases and come up with an interface that will allow (eventually) supporting all of those features.