datafusion Convert indexed ARRAY_AGG to NTH

Is your feature request related to a problem or challenge?

Query below

SELECT a, ARRAY_AGG(c ORDER BY c)[1] as result
                        FROM multiple_ordered_table
                        GROUP BY a;

and

SELECT a, NTH_VALUE(c, 1 ORDER BY c) as result
                        FROM multiple_ordered_table
                        GROUP BY a;

produces same results. However, first query generates following plan

"ProjectionExec: expr=[a@0 as a, (ARRAY_AGG(multiple_ordered_table.c) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]@1).[1] as result]",
"  AggregateExec: mode=Single, gby=[a@0 as a], aggr=[ARRAY_AGG(multiple_ordered_table.c)], ordering_mode=Sorted",
"    CsvExec: file_groups={1 group: [[CSV_PATH]]}, projection=[a, c], output_orderings=[[a@0 ASC NULLS LAST], [c@1 ASC NULLS LAST]], has_header=true",

whereas second query generates following plan

"ProjectionExec: expr=[a@0 as a, NTH_VALUE(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]@1 as result]",
"  AggregateExec: mode=Single, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1))], ordering_mode=Sorted",
"    CsvExec: file_groups={1 group: [[CSV_PATH]]}, projection=[a, c], output_orderings=[[a@0 ASC NULLS LAST], [c@1 ASC NULLS LAST]], has_header=true",

Describe the solution you'd like

we can rewrite first query as second one, which executes faster with less memory. Because it no longer needs to keep all results in the array_agg.

Describe alternatives you've considered

No response

Additional context

No response

Feb 13 '24 10:02 mustafasrepo

I can do this one

Feb 18 '24 03:02 Lordworms

Hello @mustafasrepo could you give me the actual table creation SQL for this issue?

Feb 19 '24 02:02 Lordworms

Hello @mustafasrepo could you give me the actual table creation SQL for this issue?

Sure,

CREATE EXTERNAL TABLE multiple_ordered_table (
  a0 INTEGER,
  a INTEGER,
  b INTEGER,
  c INTEGER,
  d INTEGER
)
STORED AS CSV
WITH HEADER ROW
WITH ORDER (a ASC, b ASC)
WITH ORDER (c ASC)
LOCATION '../core/tests/data/window_2.csv';

you can create the table in the queries with above snippet

Feb 19 '24 08:02 mustafasrepo

take

Feb 20 '24 22:02 Lordworms

As @Lordworms points out, maybe we can try to implement this feature in some way that is general and not special cased in the optimzer -- aka https://github.com/apache/arrow-datafusion/issues/9289

This would look like

Making ARRAY_AGG an AggregateUDF (which probably would mean making a datafusion-aggregates crate
Adding an API to do the rewrite

This would certainly take more work and thus more time than just implementing a special case for the BuiltInAggregateFunction, so I don't think it is necessary

However, if we think this is a reasonbale approach I can file some tickets with the basic ideas sketeched out (I didn't want to sketch out too many things at once and we already have a bunch of work related to pulling out scalar UDFs)

Feb 21 '24 07:02 alamb

Yes, I truely want to add some general approach instead of writing some "if else", but I agree that currently we have more ticket than that, I could do this one later and do some current-ticket...

Feb 21 '24 15:02 Lordworms

datafusion
datafusion copied to clipboard

Convert indexed ARRAY_AGG to NTH_VALUE

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

datafusion datafusion copied to clipboard

Convert indexed ARRAY_AGG to NTH_VALUE

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

datafusion
datafusion copied to clipboard