datachain Introduce group

These has to work inside DB:

from datachain import func

chain.group_by(
        name=func.first(Column("name")),
        total=func.sum(Column("num")),
        cnt=func.count(),
        partition_by=Column("class"),
)

Open question:

Should it be a separate group_by() or a part of agg()?

Functions to implement:

[x] count
[x] sum
[x] avg
[x] min
[x] max
[ ] first
[ ] collect / list
[x] concat - concatenate strings

Later:

[ ] last
[ ] std
[ ] var
[ ] min - for not int/float should return None is None in a list, otherwise - any value(like max([None, None, "Cat", "Dog", None]) --> None)
[ ] max - for not int/float should return any value but None (like max([None, None, "Cat", "Dog", None]) --> "Cat")

Aug 02 '24 23:08 dmpetrov

Hi @dmpetrov. I think group_by should be implemented as a separate method, rather than as part of agg(). This approach would provide a clearer API and maintain consistency with the conventional usage in most data processing libraries (such as pandas and SQL).

Aug 07 '24 09:08 EdwardLi-coder

@EdwardLi-coder agree, it seems a cleaner API. In general, I like the idea of separating DB/CPU compute from application/GPU compute. Like mutate() and map().

Aug 07 '24 19:08 dmpetrov

Intermediate results:

group_by.py:

import os
from datachain import DataChain, func


def path_ext(path):
    _, ext = os.path.splitext(path)
    return (ext.lstrip("."),)


(
    DataChain.from_storage("s3://dql-50k-laion-files/")
    .map(
        path_ext,
        params=["file.path"],
        output={"path_ext": str},
    )
    .group_by(
        total_size=func.sum("file.size"),
        cnt=func.count(),
        partition_by="path_ext",
    )
    .show()
)

Running:

~/playground $ python group_by.py
  path_ext  total_size    cnt
0      jpg  1079645149  43042
1     json    29743128  43047
2  parquet    15378208      5
3      txt     2927814  43042
~/playground $

TBD: cleanup the code, add more aggregate functions, add tests and create PR. Draft PR: https://github.com/iterative/datachain/pull/482

Sep 27 '24 13:09 dreadatour

Merged. Closing this issue as work will continue in the follow-up https://github.com/iterative/datachain/issues/523 issue.

Oct 20 '24 03:10 dreadatour

Introduce group_by