Daft icon indicating copy to clipboard operation
Daft copied to clipboard

[FEAT] Aggregations on List Types

Open samster25 opened this issue 11 months ago • 3 comments

We should support the following aggregations on the list type name space

col('x').list.sum()
  • [x] Sum
  • [x] Mean
  • [x] Min
  • [x] Max
  • [ ] Distinct (List of unique elements)
  • [ ] Count_distinct (number of unique elements)
  • [ ] Flatten (Flattens List[List[T]] to List[T])

samster25 avatar Mar 06 '24 06:03 samster25

@nsalerni Can you let me know if I missed anything?

samster25 avatar Mar 06 '24 06:03 samster25

@samster25 This covers a good chunk of the use case. Two others I can think of:

  1. The one I can see missing from the above list would be apply() (i.e. being able to take some form of custom logic to a list column). It seems like that's covered by https://github.com/Eventual-Inc/Daft/issues/1976?

  2. I'm not sure if the above would implicitly allow us to support the following, but this would be another simplified example of a use case I'd like to support:

df = daft.from_pydict({
    "strings": ["a", "b", "c", "d"],
    "lists": [[1, 1, 1, 1], [1, 1, 1, 1], [2, 2, 2], [2, 2, 2]],
})

df.groupby('lists').agg([
    (col("lists").alias("list_count"), 'count')
]).collect()

I'd imagine the output of this looking something like:

lists (Int64) | list_count (UInt64)
------------- | -----------------
[2, 2, 2]     |      2
[1, 1, 1, 1]  |      2

Today this yields:

PanicException: List(Int64) not implemented

nsalerni avatar Mar 06 '24 17:03 nsalerni

Hi @nsalerni ! I just made a new issue to track the work on grouping by list columns: https://github.com/Eventual-Inc/Daft/issues/1983

kevinzwang avatar Mar 06 '24 19:03 kevinzwang