feature_engine
feature_engine copied to clipboard
feat: Group Transformer
Is your feature request related to a problem? Please describe. Aggregating variables by a single or multiple category is a simple task
df.groupby('cat')["num_var"].transform("mean")
however to make it compatible with a Sklearn Pipeline, a Transformer with some requirements is needed. Feature Engine could provide a transformer with these functionality.
Describe the solution you'd like
A GroupTransform
class with a fit()
that stores that computes and stores the aggregated information
and a transform
method that merges this information.
Class GroupTransform:
def fit(X, y=None):
self.X_agg_ = X.groupby('cat')["num_var"].transform("mean")
def transform(X, y=None):
X.merge(self.X_agg_)
Describe alternatives you've considered
To make it 100% compatible with sklearn the fit
and transform
methods should takes as input an ndarray.
To pass the information to a new dataframe the merge
method is needed.
Hey @TremaMiguel thanks for filling this issue:
A few things to consider:
- will this transformer allow grouping per any variable, or only categorical variables? pros and cons? This refers to var 'cat' in code snippet.
- the variables that will be used to create the aggregations, 'num_var' will be only numerical? or categorical also?, functions like 'count' could be used over categorical variables
- which functions is this transformer going to consider to derive features? mean, max, can we provide a full list? that may help to decide my previous questions
- also, should we use pandas agg instead of transform?
- very important: the transformer should learn the mappings in fit, and then use same mappings in transform, I guess pandas merge will take care of this?
- it should allow the transformation of multiple variables at the same time... need to think how to enter the variables to be grouped by and the variables to derive the calculations from
- Feature-engine is designed to work with dataframes. The function is_dataframe has a workaround to allow np arrays, mostly to pass the check_estimator tests from sklearn, but I don't think we should extend the functionality beyond dataframes at this point.
- where would these transformer fit in? categorical encoding? creation? is kind of a hybrid of both...
Finally, would this capture #244 ? or not at all?
Hi @solegalli, these are good points to consider. I resume my answer in the following points.
1. I think the transformer should not limit the arguments that pandas.DataFrame.groupby
already support
mapping, function, label, or list of labels
so, user could use be both categorical and numerical variables or any other thing that define the groups.
2. We could set as default mean, max, min, skew, std
if aggregated variables is numeric or count, nunique, most frequent
if aggregated variable is categorical.
3. Agree on pandas.agg
as it allow to rename the new column. Although, pandas.transform
can accept a dict like to create new features.
4. Yes, the merge in transform
method would add the new column to the input dataframe. Checking the code from is_dataframe
if isinstance(X, (np.generic, np.ndarray)):
col_names = [str(i) for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=col_names)
if an n.ndarray is taken as input it would not find the same key to make the merge match. Is there some way to specify here oto take only as input a DataFrame, but at the same time make it compatible with sklearn ?.
5. Can you explain a little bit more with "it should allow the transformation of multiple variables at the same time"
?, I imagine that 1 transformer should take only a single group of features (group by) as input, because each transformer should have a single responsability behavior.
The underlying idea:
class GroupTransform()
def __init__(grouping_vars=['var1', ;var2'], agg_vars=['var3', 'var4'], etc..)
def fit():
for var in self.grouping_vars:
X_merge = X.groupby(var)[self.agg_vars].agg('mean', 'std', etc)
Something like that.
Also, is the best place for this transformer the module encoding? and would the transformer drop the original variables? maybe offer that as parameter. And finally, in the aggregation, would the transformer allow aggregating by the target? or should aggregating the target be a different transformer?
@solegalli is a good idea to let user decide wether or not to drop the original agg_vars
. As the user has the control over what to agreggate, that is which variable to pass to the df.agg
method, the user is free to pass a target variable. However, here would be duplicate functionality as the MeanEncoder in feature-engine, but that's up to the user.
We can also add a parameter to handle unknown variables to raise an error or ignore it, resuming the ideas we have
class GroupTransform()
def __init__(grouping_vars=['var1', ;var2'], agg_vars=['var3', 'var4'], drop_agg_vars: bool = True, handle_unknown: str = 'error')
def fit():
# Similar to sklearn https://github.com/scikit-learn/scikit-learn/blob/2beed5584/sklearn/preprocessing/_encoders.py#L168
if handle_unknown == 'error':
diff = _check_unknown(Xi, cats)
if diff:
msg = ("Found unknown categories {0} in column {1}"
" during fit".format(diff, i))
raise ValueError(msg)
for var in self.grouping_vars:
X_merge = X.groupby(var)[self.agg_vars].agg('mean', 'std', etc)
Happy to move forward with this issue.
Let's leave the target out in the first version of the transformer. The target in principle should not be in the train set (or so feature engine assumes), so to allow the target, the user needs to pass x and y to fit and we need to merge, as we do in the weight of evidence transformer for example.
I am thinking that the default functionality should be to be to group only by categorical variables (the agg can be both over cat and num). All encoders in Feature-engine have a parameters ignore_format
, that allows the user to extend the transformer's functionality to num variables if they so wish.
if an n.ndarray is taken as input it would not find the same key to make the merge match. Is there some way to specify here oto take only as input a DataFrame, but at the same time make it compatible with sklearn ?.
There is a workaround in is_dataframe, that assings the column index as string name. So it should work.
I am thinking that this one, may also be a good candidate for a new module called "embedding". Thoughts?