datafusion
datafusion copied to clipboard
Array agg groups accumulator, second attempt
Continuation of https://github.com/apache/datafusion/pull/10149
Which issue does this PR close?
Closes #10145. Closes https://github.com/apache/datafusion/pull/10149
Rationale for this change
See the issue.
What changes are included in this PR?
GroupsAccumulator for array_agg
aggregation function for:
- Primitive types
- String type
Not included:
- Accumulating arrays of any level of nesting.
Are these changes tested?
Extended tests in aggregate.slt
Are there any user-facing changes?
Yes, IGNORE NULLS
now works with array_agg.
I have not yet addressed the comments of the original PR:
- support for LargeUtf8 and Binary (https://github.com/apache/datafusion/pull/10149#discussion_r1573482631)
- benchmark
but instead want to first find out if my approach to null handling is generally valid or not.
The problem with null handling is that this implementation originally ignored null values.
However a test in aggregate.slt
added meanwhile asserts that it is included.
Now I asked in Discord what approach is correct, to which I was informed that DF is trying to follow PG and Spark.
I thus checked PG and Spark, and as it turns out PG includes nulls while Spark does not:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import array_agg
>>> data = [("value1",), (None,), ("value3",), (None,), ("value5",)]
>>> df = spark.createDataFrame(data, ["column1"], StringType())
>>> df.count()
5
>>> df.agg(array_agg('column1').alias('r')).collect()
[Row(r=['value1', 'value3', 'value5'])]
However:
-- create
CREATE TABLE EMPLOYEE (
empId INTEGER PRIMARY KEY,
name TEXT,
dept TEXT NOT NULL
);
-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Clark', 'Sales');
INSERT INTO EMPLOYEE VALUES (0002, 'Dave', 'Accounting');
INSERT INTO EMPLOYEE VALUES (0003, 'Ava', 'Sales');
INSERT INTO EMPLOYEE VALUES (0004, NULL, 'Sales');
-- fetch
SELECT ARRAY_AGG(name), dept FROM EMPLOYEE group by dept;
array_agg | dept
------------------+------------
{Dave} | Accounting
{Clark,Ava,NULL} | Sales
(2 rows)
So I decided to make everyone happy by adding support for IGNORE NULLS