datafusion suggest add synax `select from generate

Is your feature request related to a problem or challenge?

it is complex and slowly to use unnest().

Describe the solution you'd like

it can run select from generate_series() as duckdb does.

Describe alternatives you've considered

No response

Additional context

D:\duckdb>datafusion-cli
DataFusion CLI v37.0.0

> select sum(a) from unnest(generate_series(1,1000000)) as t(a);
+--------------+
| SUM(t.a)     |
+--------------+
| 500000500000 |
+--------------+
1 row(s) fetched.
Elapsed 3.648 seconds.
> select sum(a) from generate_series(1,1000000) as t(a);
Error during planning: table function 'generate_series' not found
>
D:\duckdb>duckdb011
v0.10.1 4a89d97db8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D .timer on
D select sum(a) from unnest(generate_series(1,1000000)) as t(a);
┌──────────────┐
│    sum(a)    │
│    int128    │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.854 user 0.859375 sys 0.078125
D select sum(a) from generate_series(1,1000000) as t(a);
┌──────────────┐
│    sum(a)    │
│    int128    │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.033 user 0.015625 sys 0.000000

Apr 13 '24 10:04 l1t1

I think we could add this as a user defined table function in datafusion-cli quite easily

For example we could follow the model of parquet_metadata (which also follows duckdb):

SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.

It is documented here https://arrow.apache.org/datafusion/user-guide/cli.html

The code for it is here; https://github.com/apache/arrow-datafusion/blob/637293580db0634a4efbd3f52e4700992ee3080d/datafusion-cli/src/functions.rs#L215-L442

Apr 13 '24 13:04 alamb

related https://github.com/apache/arrow-datafusion/issues/9354

Apr 13 '24 14:04 Lordworms

I think switching generate_series to a UDTF will break the API. I think in most cases it's an acceptable API change. However there is (at least for me) 1 particular use case that of the array generated from generate_series that will probably need closer attention.

Example

I need to spread the data equally over equal intervals between 2 data points. I use `generate_series` to generate an array per row. The way that I do this now is as follows:

create table manual(time timestamp, index_value DOUBLE);
insert into
  manual (time, index_value)
values
  ('2022-01-31 23:00:00.0000000', 935035.000),
  ('2022-02-28 23:00:00.0000000', 954469.000),
  ('2022-03-31 22:00:00.0000000', 905204.000),
  ('2022-04-30 22:00:00.0000000', 971415.000);

with
    index_values AS (
    select
        time AS to_time,
        lag(time, 1) over (order by time) AS from_time,
        index_value AS to_index,
        lag(index_value, 1) over (order by time) AS from_index
    from manual
),
    series AS (
        select
            unnest(generate_series(from_time, to_time, interval '1 month')) AS time,
            (case
                when to_index < from_index THEN (999999 - from_index) + to_index
                else to_index - from_index end)
                / coalesce(nullif(array_length(generate_series(from_time, to_time, interval '1 month')), 0), 1) AS delta_value
        from index_values
    ) SELECT * FROM series;

Specifically array_length(generate_series(from_time, to_time, interval '1 month')) will need to be done differently if generate_series becomes a UDTF. It will probably need to become COUNT(time) OVER (partition by to_time, from_time) :

with
    index_values AS (
    select
        time AS to_time,
        lag(time, 1) over (order by time) AS from_time,
        index_value AS to_index,
        lag(index_value, 1) over (order by time) AS from_index
    from manual
),
    series AS (
        select
            time,
            (case
                when to_index < from_index THEN (999999 - from_index) + to_index
                else to_index - from_index end)
                / coalesce(nullif(count(time) over (partition by to_time, from_time), 0), 1) AS delta_value
        from index_values, generate_series(from_time, to_time, interval '900 seconds') AS time
    ) SELECT * FROM series;

(If you have other/better suggestions for how data can be spreaded/interpolated in raw DF, all ears :) )

TL;DR: If you're using generate_series to generate a series on a row-based basis, you may need to consider OVER PARTITION BY (your_row_composite_key) if this API change takes place

Just mentioning this here in case anyone is using generate_series in a similar way as I do. and because I want to know if this is ever added :p then I know I need to update my queries

Sep 25 '24 09:09 Abdullahsab3

FWIW I think we should consider keeping the existing generate_series function and add a new user defined table function with the same name

Sep 25 '24 16:09 alamb

That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head:

select generate_series(1,3), * from generate_series(10, 13)

This can be either (postgresql behavior):

generate_series	generate_series
1	10
2	10
3	10
1	11
2	11
3	11
1	12
2	12
3	12
1	13
2	13
3	13

or

+------------------------------------+----------------------------------------------+
| generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) |
+------------------------------------+----------------------------------------------+
| [1, 2, 3]                          | 10                                           |
| [1, 2, 3]                          | 11                                           |
| [1, 2, 3]                          | 12                                           |
| [1, 2, 3]                          | 13                                           |
+------------------------------------+----------------------------------------------+
4 row(s) fetched.
Elapsed 0.001 seconds.

Also, I wonder how the results of something like this would look like 🤔:

select generate_series(1, 3), array_length(generate_series(1, 3))

Sep 25 '24 17:09 Abdullahsab3

That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head:

select generate_series(1,3), * from generate_series(10, 13)

This can be either (postgresql behavior):

generate_series generate_series 1 10 2 10 3 10 1 11 2 11 3 11 1 12 2 12 3 12 1 13 2 13 3 13 or

+------------------------------------+----------------------------------------------+
| generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) |
+------------------------------------+----------------------------------------------+
| [1, 2, 3]                          | 10                                           |
| [1, 2, 3]                          | 11                                           |
| [1, 2, 3]                          | 12                                           |
| [1, 2, 3]                          | 13                                           |
+------------------------------------+----------------------------------------------+
4 row(s) fetched.
Elapsed 0.001 seconds.

Also, I wonder how the results of something like this would look like 🤔:

select generate_series(1, 3), array_length(generate_series(1, 3))

Maybe we can use the 2nd interpretation (the same as DuckDB), currently, DataFusion's generate_series in expression is already different from Postgres's semantics

Oct 19 '24 04:10 2010YOUY01

I think we could add this as a user defined table function in datafusion-cli quite easily

@alamb Is it possible to implement this UDTF inside datafusion crate? 🤔 I find this table function can make end-to-end tests for spilling related queries much easier

Oct 19 '24 04:10 2010YOUY01

I think we could add this as a user defined table function in datafusion-cli quite easily

@alamb Is it possible to implement this UDTF inside datafusion crate? 🤔 I find this table function can make end-to-end tests for spilling related queries much easier

Makes sense to me -- we may need to create a datafusion-functions-table sub crate too, however 🤔 as a place to put it

Oct 21 '24 20:10 alamb

BTW you can get a similar effect with unnest:

> select unnest(generate_series(1, 10));
+---------------------------------------------+
| UNNEST(generate_series(Int64(1),Int64(10))) |
+---------------------------------------------+
| 1                                           |
| 2                                           |
| 3                                           |
| 4                                           |
| 5                                           |
| 6                                           |
| 7                                           |
| 8                                           |
| 9                                           |
| 10                                          |
+---------------------------------------------+
10 row(s) fetched.
Elapsed 0.020 seconds.

Oct 21 '24 21:10 alamb

datafusion
datafusion copied to clipboard

suggest add synax `select from generate_series()`

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

datafusion datafusion copied to clipboard

suggest add synax `select from generate_series()`

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

datafusion
datafusion copied to clipboard