df.groupby().agg() feature
Description
Added **kwargs Support in df.groupby().agg()
Implemented support for **kwargs in df.groupby().agg() to provide greater flexibility and customization options during aggregation operations.
This issue closes #15967
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.
Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.
Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.
Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.
Yeah sure @vyasr. I don't have any prior experience writing test cases before but I shall try my best. Also should I just leave the code in dask as it is or should i remove it?
I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.
I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.
Let's first get the cudf implementation right first, it might then be we don't have to do anything dask-dataframe.
To do this, I think we want to mimic the way the pandas API behaves (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html)
I don't think we support arguments and engine to agg in cudf, so we want to support this signature:
def agg(self, func, **kwargs):
The behaviour of pandas is:
- If
funcis notNone, thenkwargsare ignored (silently) - If
funcisNone, thenkwargsare used with the key defining the output column name, and the value defining the(column, aggregation)pair.
This preprocessing should happen in cudf/python/cudf/cudf/core/groupby/groupby.py in the agg method (around line 600).
I think the way to attack this is to let _normalize_aggs take both func and kwargs, and produce appropriately preprocessed results. There's a little bit of a dance we need to do to get the right names out, but it's certainly possible.
Understood @wence-
I will start by reading the documentation to grasp the required changes and proceed to implement them.
Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .
Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.
Is there anything else I need to consider?
Understood @wence-
I will start by reading the documentation to grasp the required changes and proceed to implement them.
Following this, I will add few more test cases taking reference from the examples provided in the
groupby.pyfile intopython/cudf/cudf/tests/groupby/test_agg.pyas requested by @vyasr .Given that my PC lacks a GPU, I will then run these tests using only
pandasand then when all test cases pass, switch tocudfbefore committing and making a pull request.Is there anything else I need to consider?
That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.
Understood @wence- I will start by reading the documentation to grasp the required changes and proceed to implement them. Following this, I will add few more test cases taking reference from the examples provided in the
groupby.pyfile intopython/cudf/cudf/tests/groupby/test_agg.pyas requested by @vyasr . Given that my PC lacks a GPU, I will then run these tests using onlypandasand then when all test cases pass, switch tocudfbefore committing and making a pull request. Is there anything else I need to consider?That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.
Sure thing! Will do thanks!
@NeilGeorge1 have you had any luck here? Do you need any pointers?
I'll open a new PR for this. xref #16528