feature_engine feat: Group Transformer

Is your feature request related to a problem? Please describe. Aggregating variables by a single or multiple category is a simple task

df.groupby('cat')["num_var"].transform("mean")

however to make it compatible with a Sklearn Pipeline, a Transformer with some requirements is needed. Feature Engine could provide a transformer with these functionality.

Describe the solution you'd like

A GroupTransform class with a fit() that stores that computes and stores the aggregated information and a transform method that merges this information.

Class GroupTransform:

    def fit(X, y=None):
        self.X_agg_ = X.groupby('cat')["num_var"].transform("mean") 

    def transform(X, y=None):
        X.merge(self.X_agg_)

Describe alternatives you've considered To make it 100% compatible with sklearn the fit and transform methods should takes as input an ndarray. To pass the information to a new dataframe the merge method is needed.

Jul 17 '21 14:07 TremaMiguel

Hey @TremaMiguel thanks for filling this issue:

A few things to consider:

will this transformer allow grouping per any variable, or only categorical variables? pros and cons? This refers to var 'cat' in code snippet.
the variables that will be used to create the aggregations, 'num_var' will be only numerical? or categorical also?, functions like 'count' could be used over categorical variables
which functions is this transformer going to consider to derive features? mean, max, can we provide a full list? that may help to decide my previous questions
also, should we use pandas agg instead of transform?
very important: the transformer should learn the mappings in fit, and then use same mappings in transform, I guess pandas merge will take care of this?
it should allow the transformation of multiple variables at the same time... need to think how to enter the variables to be grouped by and the variables to derive the calculations from
Feature-engine is designed to work with dataframes. The function is_dataframe has a workaround to allow np arrays, mostly to pass the check_estimator tests from sklearn, but I don't think we should extend the functionality beyond dataframes at this point.
where would these transformer fit in? categorical encoding? creation? is kind of a hybrid of both...

Finally, would this capture #244 ? or not at all?

Jul 18 '21 07:07 solegalli

Hi @solegalli, these are good points to consider. I resume my answer in the following points.

1. I think the transformer should not limit the arguments that pandas.DataFrame.groupby already support

mapping, function, label, or list of labels

so, user could use be both categorical and numerical variables or any other thing that define the groups.

2. We could set as default mean, max, min, skew, std if aggregated variables is numeric or count, nunique, most frequent if aggregated variable is categorical.

3. Agree on pandas.agg as it allow to rename the new column. Although, pandas.transform can accept a dict like to create new features.

4. Yes, the merge in transform method would add the new column to the input dataframe. Checking the code from is_dataframe

if isinstance(X, (np.generic, np.ndarray)):
    col_names = [str(i) for i in range(X.shape[1])]
    X = pd.DataFrame(X, columns=col_names)

if an n.ndarray is taken as input it would not find the same key to make the merge match. Is there some way to specify here oto take only as input a DataFrame, but at the same time make it compatible with sklearn ?.

5. Can you explain a little bit more with "it should allow the transformation of multiple variables at the same time" ?, I imagine that 1 transformer should take only a single group of features (group by) as input, because each transformer should have a single responsability behavior.

Jul 19 '21 13:07 TremaMiguel

The underlying idea:

class GroupTransform()

    def __init__(grouping_vars=['var1', ;var2'], agg_vars=['var3', 'var4'], etc..)


   def fit():
       for var in self.grouping_vars:
           X_merge = X.groupby(var)[self.agg_vars].agg('mean', 'std', etc)

Something like that.

Jul 21 '21 12:07 solegalli

Also, is the best place for this transformer the module encoding? and would the transformer drop the original variables? maybe offer that as parameter. And finally, in the aggregation, would the transformer allow aggregating by the target? or should aggregating the target be a different transformer?

Jul 21 '21 12:07 solegalli

@solegalli is a good idea to let user decide wether or not to drop the original agg_vars. As the user has the control over what to agreggate, that is which variable to pass to the df.agg method, the user is free to pass a target variable. However, here would be duplicate functionality as the MeanEncoder in feature-engine, but that's up to the user.

We can also add a parameter to handle unknown variables to raise an error or ignore it, resuming the ideas we have

class GroupTransform()

    def __init__(grouping_vars=['var1', ;var2'], agg_vars=['var3', 'var4'], drop_agg_vars: bool = True, handle_unknown: str = 'error')

    def fit():

        # Similar to sklearn https://github.com/scikit-learn/scikit-learn/blob/2beed5584/sklearn/preprocessing/_encoders.py#L168
        if handle_unknown == 'error':
            diff = _check_unknown(Xi, cats)
            if diff:
                msg = ("Found unknown categories {0} in column {1}"
                       " during fit".format(diff, i))
                raise ValueError(msg)

        for var in self.grouping_vars:
            X_merge = X.groupby(var)[self.agg_vars].agg('mean', 'std', etc)

Jul 21 '21 14:07 TremaMiguel

Happy to move forward with this issue.

Let's leave the target out in the first version of the transformer. The target in principle should not be in the train set (or so feature engine assumes), so to allow the target, the user needs to pass x and y to fit and we need to merge, as we do in the weight of evidence transformer for example.

I am thinking that the default functionality should be to be to group only by categorical variables (the agg can be both over cat and num). All encoders in Feature-engine have a parameters ignore_format, that allows the user to extend the transformer's functionality to num variables if they so wish.

if an n.ndarray is taken as input it would not find the same key to make the merge match. Is there some way to specify here oto take only as input a DataFrame, but at the same time make it compatible with sklearn ?.

There is a workaround in is_dataframe, that assings the column index as string name. So it should work.

Jul 22 '21 03:07 solegalli

I am thinking that this one, may also be a good candidate for a new module called "embedding". Thoughts?

Jul 22 '21 05:07 solegalli

feature_engine feature_engine copied to clipboard

feat: Group Transformer

feature_engine
feature_engine copied to clipboard