datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Array agg groups accumulator, second attempt

Open markusa380 opened this issue 8 months ago • 5 comments

Continuation of https://github.com/apache/datafusion/pull/10149

Which issue does this PR close?

Closes #10145. Closes https://github.com/apache/datafusion/pull/10149

Rationale for this change

See the issue.

What changes are included in this PR?

GroupsAccumulator for array_agg aggregation function for:

  • Primitive types
  • String type

Not included:

  • Accumulating arrays of any level of nesting.

Are these changes tested?

Extended tests in aggregate.slt

Are there any user-facing changes?

Yes, IGNORE NULLS now works with array_agg.


I have not yet addressed the comments of the original PR:

  • support for LargeUtf8 and Binary (https://github.com/apache/datafusion/pull/10149#discussion_r1573482631)
  • benchmark

but instead want to first find out if my approach to null handling is generally valid or not.

The problem with null handling is that this implementation originally ignored null values. However a test in aggregate.slt added meanwhile asserts that it is included. Now I asked in Discord what approach is correct, to which I was informed that DF is trying to follow PG and Spark. I thus checked PG and Spark, and as it turns out PG includes nulls while Spark does not:

>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import array_agg
>>> data = [("value1",), (None,), ("value3",), (None,), ("value5",)]
>>> df = spark.createDataFrame(data, ["column1"], StringType())
>>> df.count()
5
>>> df.agg(array_agg('column1').alias('r')).collect()
[Row(r=['value1', 'value3', 'value5'])]

However:

-- create
CREATE TABLE EMPLOYEE (
  empId INTEGER PRIMARY KEY,
  name TEXT,
  dept TEXT NOT NULL
);

-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Clark', 'Sales');
INSERT INTO EMPLOYEE VALUES (0002, 'Dave', 'Accounting');
INSERT INTO EMPLOYEE VALUES (0003, 'Ava', 'Sales');
INSERT INTO EMPLOYEE VALUES (0004, NULL, 'Sales');

-- fetch 
SELECT ARRAY_AGG(name), dept FROM EMPLOYEE group by dept;

    array_agg     |    dept
------------------+------------
 {Dave}           | Accounting
 {Clark,Ava,NULL} | Sales
(2 rows)

So I decided to make everyone happy by adding support for IGNORE NULLS

markusa380 avatar Jun 24 '24 10:06 markusa380