[FEA] make merge operation optional after Groupby
Issue by rnyak
Thursday May 21, 2020 at 17:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/163
Is your feature request related to a problem? Please describe.
Currently, after applying Groupby operation, groupbyed columns are merged with the original data frame (gdf) and we get a new_gdf (see the op_logic function below). I was thinking can we have some flexibility here? Like adding merge option(param) as merge = True, or inplace=True in groupby operation, then, the groupbyed features would be merged, if False then, the user can have another dataframe only with groupbyed columns (cats and conts).
class GroupBy(DFOperator):
"""
One of the ways to create new features is to calculate
the basic statistics of the data that is grouped by a categorical
feature. This operator groups the data by the given categorical
feature(s) and calculates the std, variance, and sum of requested continuous
features along with count of every group. Then, merges these new statistics
with the data using the unique ids of categorical data.
Although you can directly call methods of this class to
transform your categorical features, it's typically used within a
Workflow class.
Parameters
-----------
....
def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
if self.cat_names is None:
raise ValueError("cat_names cannot be None.")
new_gdf = cudf.DataFrame()
for name in stats_context["moments"]:
tran_gdf = stats_context["moments"][name].merge(gdf)
new_gdf[tran_gdf.columns] = tran_gdf
return new_gdf
Describe the solution you'd like
This is an example how we can apply Groupby operation:
proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']))
can we add a merge param here like below?
proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']), merge =True)
One aspect is that, if merge=False then we need to return a separate df as a result of Groupby operation. Better to find/think of use-cases where this will be practically useful:
- we are applying Groupby and all we want to use groupbyed features.
- we are applying Groupby and all we want to use the stats that we obtain, and may be use these stats as a normalization factor for some other feature?
- we want to merge it with the original gdf, but can we drop the original columns (if we want to)?
@rnyak - The CategoryStatistics op is the StatOperator dependency for GroupBy. That operator will create a separate "groupby statistics" DataFrame for each categorical column targeted by your GroupBy op (and persists them to disk). Does using CategoryStatistics alone already provide the functionality you need?
@rjzamora will test it and let you know. Thanks.