cudf icon indicating copy to clipboard operation
cudf copied to clipboard

df.groupby().agg() feature

Open NeilGeorge1 opened this issue 1 year ago • 8 comments

Description

Added **kwargs Support in df.groupby().agg()

Implemented support for **kwargs in df.groupby().agg() to provide greater flexibility and customization options during aggregation operations.

This issue closes #15967

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [ ] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

NeilGeorge1 avatar Jun 25 '24 18:06 NeilGeorge1

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jun 25 '24 18:06 copy-pr-bot[bot]

Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.

Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.

vyasr avatar Jun 25 '24 18:06 vyasr

Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.

Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.

Yeah sure @vyasr. I don't have any prior experience writing test cases before but I shall try my best. Also should I just leave the code in dask as it is or should i remove it?

NeilGeorge1 avatar Jun 25 '24 19:06 NeilGeorge1

I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.

vyasr avatar Jun 25 '24 20:06 vyasr

I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.

Let's first get the cudf implementation right first, it might then be we don't have to do anything dask-dataframe.

To do this, I think we want to mimic the way the pandas API behaves (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html)

I don't think we support arguments and engine to agg in cudf, so we want to support this signature:

def agg(self, func, **kwargs):

The behaviour of pandas is:

  • If func is not None, then kwargs are ignored (silently)
  • If func is None, then kwargs are used with the key defining the output column name, and the value defining the (column, aggregation) pair.

This preprocessing should happen in cudf/python/cudf/cudf/core/groupby/groupby.py in the agg method (around line 600).

I think the way to attack this is to let _normalize_aggs take both func and kwargs, and produce appropriately preprocessed results. There's a little bit of a dance we need to do to get the right names out, but it's certainly possible.

wence- avatar Jun 26 '24 09:06 wence-

Understood @wence-

I will start by reading the documentation to grasp the required changes and proceed to implement them.

Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .

Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.

Is there anything else I need to consider?

NeilGeorge1 avatar Jun 26 '24 11:06 NeilGeorge1

Understood @wence-

I will start by reading the documentation to grasp the required changes and proceed to implement them.

Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .

Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.

Is there anything else I need to consider?

That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.

wence- avatar Jun 26 '24 13:06 wence-

Understood @wence- I will start by reading the documentation to grasp the required changes and proceed to implement them. Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr . Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request. Is there anything else I need to consider?

That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.

Sure thing! Will do thanks!

NeilGeorge1 avatar Jun 26 '24 14:06 NeilGeorge1

@NeilGeorge1 have you had any luck here? Do you need any pointers?

vyasr avatar Jul 19 '24 17:07 vyasr

I'll open a new PR for this. xref #16528

Matt711 avatar Aug 10 '24 01:08 Matt711