NVTabular [FEA] make merge operation optional after Groupby

Issue by rnyak Thursday May 21, 2020 at 17:35 GMT Originally opened as https://github.com/rapidsai/recsys/issues/163

Is your feature request related to a problem? Please describe.

Currently, after applying Groupby operation, groupbyed columns are merged with the original data frame (gdf) and we get a new_gdf (see the op_logic function below). I was thinking can we have some flexibility here? Like adding merge option(param) as merge = True, or inplace=True in groupby operation, then, the groupbyed features would be merged, if False then, the user can have another dataframe only with groupbyed columns (cats and conts).

class GroupBy(DFOperator):
    """
    One of the ways to create new features is to calculate
    the basic statistics of the data that is grouped by a categorical
    feature. This operator groups the data by the given categorical
    feature(s) and calculates the std, variance, and sum of requested continuous
    features along with count of every group. Then, merges these new statistics
    with the data using the unique ids of categorical data.
    Although you can directly call methods of this class to
    transform your categorical features, it's typically used within a
    Workflow class.
    Parameters
    -----------
   ....
    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        if self.cat_names is None:
            raise ValueError("cat_names cannot be None.")

        new_gdf = cudf.DataFrame()
        for name in stats_context["moments"]:
            tran_gdf = stats_context["moments"][name].merge(gdf)
            new_gdf[tran_gdf.columns] = tran_gdf

        return new_gdf

Describe the solution you'd like

This is an example how we can apply Groupby operation: proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']))

can we add a merge param here like below?

proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']), merge =True)

One aspect is that, if merge=False then we need to return a separate df as a result of Groupby operation. Better to find/think of use-cases where this will be practically useful:

we are applying Groupby and all we want to use groupbyed features.
we are applying Groupby and all we want to use the stats that we obtain, and may be use these stats as a normalization factor for some other feature?
we want to merge it with the original gdf, but can we drop the original columns (if we want to)?

Jun 04 '20 23:06 benfred

@rnyak - The CategoryStatistics op is the StatOperator dependency for GroupBy. That operator will create a separate "groupby statistics" DataFrame for each categorical column targeted by your GroupBy op (and persists them to disk). Does using CategoryStatistics alone already provide the functionality you need?

Jul 16 '20 14:07 rjzamora

@rjzamora will test it and let you know. Thanks.

Jul 18 '20 18:07 rnyak