datafusion
datafusion copied to clipboard
suggest add synax `select from generate_series()`
Is your feature request related to a problem or challenge?
it is complex and slowly to use unnest().
Describe the solution you'd like
it can run select from generate_series() as duckdb does.
Describe alternatives you've considered
No response
Additional context
D:\duckdb>datafusion-cli
DataFusion CLI v37.0.0
> select sum(a) from unnest(generate_series(1,1000000)) as t(a);
+--------------+
| SUM(t.a) |
+--------------+
| 500000500000 |
+--------------+
1 row(s) fetched.
Elapsed 3.648 seconds.
> select sum(a) from generate_series(1,1000000) as t(a);
Error during planning: table function 'generate_series' not found
>
D:\duckdb>duckdb011
v0.10.1 4a89d97db8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .timer on
D select sum(a) from unnest(generate_series(1,1000000)) as t(a);
┌──────────────┐
│ sum(a) │
│ int128 │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.854 user 0.859375 sys 0.078125
D select sum(a) from generate_series(1,1000000) as t(a);
┌──────────────┐
│ sum(a) │
│ int128 │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.033 user 0.015625 sys 0.000000
I think we could add this as a user defined table function in datafusion-cli quite easily
For example we could follow the model of parquet_metadata (which also follows duckdb):
SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
It is documented here https://arrow.apache.org/datafusion/user-guide/cli.html
The code for it is here; https://github.com/apache/arrow-datafusion/blob/637293580db0634a4efbd3f52e4700992ee3080d/datafusion-cli/src/functions.rs#L215-L442
related https://github.com/apache/arrow-datafusion/issues/9354
I think switching generate_series to a UDTF will break the API. I think in most cases it's an acceptable API change. However there is (at least for me) 1 particular use case that of the array generated from generate_series that will probably need closer attention.
Example
I need to spread the data equally over equal intervals between 2 data points. I use `generate_series` to generate an array per row. The way that I do this now is as follows:create table manual(time timestamp, index_value DOUBLE);
insert into
manual (time, index_value)
values
('2022-01-31 23:00:00.0000000', 935035.000),
('2022-02-28 23:00:00.0000000', 954469.000),
('2022-03-31 22:00:00.0000000', 905204.000),
('2022-04-30 22:00:00.0000000', 971415.000);
with
index_values AS (
select
time AS to_time,
lag(time, 1) over (order by time) AS from_time,
index_value AS to_index,
lag(index_value, 1) over (order by time) AS from_index
from manual
),
series AS (
select
unnest(generate_series(from_time, to_time, interval '1 month')) AS time,
(case
when to_index < from_index THEN (999999 - from_index) + to_index
else to_index - from_index end)
/ coalesce(nullif(array_length(generate_series(from_time, to_time, interval '1 month')), 0), 1) AS delta_value
from index_values
) SELECT * FROM series;
Specifically array_length(generate_series(from_time, to_time, interval '1 month')) will need to be done differently if generate_series becomes a UDTF. It will probably need to become COUNT(time) OVER (partition by to_time, from_time) :
with
index_values AS (
select
time AS to_time,
lag(time, 1) over (order by time) AS from_time,
index_value AS to_index,
lag(index_value, 1) over (order by time) AS from_index
from manual
),
series AS (
select
time,
(case
when to_index < from_index THEN (999999 - from_index) + to_index
else to_index - from_index end)
/ coalesce(nullif(count(time) over (partition by to_time, from_time), 0), 1) AS delta_value
from index_values, generate_series(from_time, to_time, interval '900 seconds') AS time
) SELECT * FROM series;
(If you have other/better suggestions for how data can be spreaded/interpolated in raw DF, all ears :) )
TL;DR: If you're using generate_series to generate a series on a row-based basis, you may need to consider OVER PARTITION BY (your_row_composite_key) if this API change takes place
Just mentioning this here in case anyone is using generate_series in a similar way as I do. and because I want to know if this is ever added :p then I know I need to update my queries
FWIW I think we should consider keeping the existing generate_series function and add a new user defined table function with the same name
That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head:
select generate_series(1,3), * from generate_series(10, 13)
This can be either (postgresql behavior):
| generate_series | generate_series |
|---|---|
| 1 | 10 |
| 2 | 10 |
| 3 | 10 |
| 1 | 11 |
| 2 | 11 |
| 3 | 11 |
| 1 | 12 |
| 2 | 12 |
| 3 | 12 |
| 1 | 13 |
| 2 | 13 |
| 3 | 13 |
or
+------------------------------------+----------------------------------------------+
| generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) |
+------------------------------------+----------------------------------------------+
| [1, 2, 3] | 10 |
| [1, 2, 3] | 11 |
| [1, 2, 3] | 12 |
| [1, 2, 3] | 13 |
+------------------------------------+----------------------------------------------+
4 row(s) fetched.
Elapsed 0.001 seconds.
Also, I wonder how the results of something like this would look like 🤔:
select generate_series(1, 3), array_length(generate_series(1, 3))
That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head:
select generate_series(1,3), * from generate_series(10, 13)This can be either (postgresql behavior):
generate_series generate_series 1 10 2 10 3 10 1 11 2 11 3 11 1 12 2 12 3 12 1 13 2 13 3 13 or
+------------------------------------+----------------------------------------------+ | generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) | +------------------------------------+----------------------------------------------+ | [1, 2, 3] | 10 | | [1, 2, 3] | 11 | | [1, 2, 3] | 12 | | [1, 2, 3] | 13 | +------------------------------------+----------------------------------------------+ 4 row(s) fetched. Elapsed 0.001 seconds.Also, I wonder how the results of something like this would look like 🤔:
select generate_series(1, 3), array_length(generate_series(1, 3))
Maybe we can use the 2nd interpretation (the same as DuckDB), currently, DataFusion's generate_series in expression is already different from Postgres's semantics
I think we could add this as a user defined table function in datafusion-cli quite easily
@alamb Is it possible to implement this UDTF inside datafusion crate? 🤔 I find this table function can make end-to-end tests for spilling related queries much easier
I think we could add this as a user defined table function in datafusion-cli quite easily
@alamb Is it possible to implement this UDTF inside
datafusioncrate? 🤔 I find this table function can make end-to-end tests for spilling related queries much easier
Makes sense to me -- we may need to create a datafusion-functions-table sub crate too, however 🤔 as a place to put it
BTW you can get a similar effect with unnest:
> select unnest(generate_series(1, 10));
+---------------------------------------------+
| UNNEST(generate_series(Int64(1),Int64(10))) |
+---------------------------------------------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+---------------------------------------------+
10 row(s) fetched.
Elapsed 0.020 seconds.