scikit-lego
scikit-lego copied to clipboard
[BUG] GroupedTransformer can't deal with transformers that use `y`
Hi! First of all thanks for your package, it's awesome! (Same as calmcode and other resources you built).
I'm trying to perform a LeaveOneOutEncoder transformation for each group of my dataset and I think it's not possible (or I can't get it), the problem is that within sklego/meta/grouped_transformer.py
the fit is called without y
and it throws an error: TypeError: fit() missing 1 required positional argument: 'y'
.
I post here a dummy example that has the same functionality that the thing I'm trying to do in my code:
import pandas as pd
import numpy as np
from sklego.datasets import load_heroes
from sklego.meta import GroupedTransformer
from category_encoders import LeaveOneOutEncoder
df_heroes = load_heroes(as_frame=True).replace([np.inf, -np.inf], np.nan).dropna()
# Dummy example
X = df_heroes[['attack_type', 'health', 'attack_spd']]
y = df_heroes['attack']
GroupedTransformer(LeaveOneOutEncoder(), groups=['attack_type']).fit_transform(X, y)
Thansk in advance!
Now I'm using this as alternative solution:
import pandas as pd
import numpy as np
from sklego.datasets import load_heroes
from category_encoders import LeaveOneOutEncoder
df_heroes = load_heroes(as_frame=True).replace([np.inf, -np.inf], np.nan).dropna()
# Dummy example
X = df_heroes[['attack_type', 'health', 'attack_spd']]
y = df_heroes['attack']
attack_types = X['attack_type'].unique()
encoders_fit = {
attack_type: LeaveOneOutEncoder().fit(
X[X['attack_type'] == attack_type].drop(columns=['attack_type']),
y[X['attack_type'] == attack_type]
)
for attack_type in attack_types
}
X_transformed = pd.concat(
[
encoders_fit[attack_type].transform(
X[X['attack_type'] == attack_type].drop(columns=['attack_type']),
y[X['attack_type'] == attack_type]
)
for attack_type in attack_types
]
)
I'm very happy to hear that you enjoy the scikit-lego project as well as calmcode, but I would like the emphasize that scikit-lego isn't my package. It's a collaborative effort from a lot of folks and in particular, @MBrouns is also a co-creator/maintainer.
Looking at our grouped transformer, it indeed seems like we assume that any encoders we pass in do not use y, only X. I suppose we didn't consider transformers that use the label.
@MBrouns, it's a slight anti-pattern, but I think I'm not opposed to supporting these sorts of encoders. Any comments on your end?
It looks like the problem appears here: https://github.com/koaning/scikit-lego/blob/main/sklego/meta/grouped_transformer.py#L90
for all the estimators fitted on the subgroups we pass y
as far as I can tell, but the global fallback model is fitted without it. The bit you linked @koaning is in the transform step and although the LeaveOneOutEncoder
does use y
during transform as well I would expect a different error.
I'm all OK for making the changes to both parts, but I would like to test whether passing along y
will actually work for the transform step. I remember from building the TrainOnlyTransformerMixin
it was not very clear when y
would get passed along to the transform step when running from a pipeline and when it wouldnt.
Do you need any help with this issue? Is it something missed on my part?
You're free to pick it up if you'd like. We're still keeping an eye on the repository, but ever since the pandemic hit we've been less active in picking up the issues.
I started to take a look into this. Maybe I am missing something, but by changing
if self.use_global_model:
- self.fallback_ = clone(self.transformer).fit(X_value)
+ self.fallback_ = clone(self.transformer).fit(X_value, y)
and adding a check_X
parameter, the results were as expected. Namely:
import pandas as pd
import numpy as np
from sklego.datasets import load_heroes
from sklego.meta import GroupedTransformer
from category_encoders import LeaveOneOutEncoder
df_heroes = load_heroes(as_frame=True).replace([np.inf, -np.inf], np.nan).dropna()
df_heroes["random"] = np.random.choice([0, 1], size=len(df_heroes))
# Dummy example
X = df_heroes[["random", "attack_type", "health", "attack_spd"]]
y = df_heroes["attack"]
is_melee = df_heroes["attack_type"] == "Melee"
X_melee = LeaveOneOutEncoder(cols=["random"]).fit(X.loc[is_melee], y.loc[is_melee]).transform(X.loc[is_melee])
X_ranged = LeaveOneOutEncoder(cols=["random"]).fit(X.loc[~is_melee], y.loc[~is_melee]).transform(X.loc[~is_melee])
X_naive = pd.concat([X_melee, X_ranged]).sort_index()
X_grouped = GroupedTransformer(LeaveOneOutEncoder(cols=[0]), groups=["attack_type"], check_X = True).fit(X, y).transform(X)
np.allclose(X_naive.drop(columns="attack_type").to_numpy(), X_grouped)
True
Notice that I couldn't pass cols="random"
in LeaveOneOutEncoder
because the transformer will receive a numpy array (output of _split_groups_and_values
function).
Additional "sanity"/interoperability checks I performed are:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklego.meta import GroupedPredictor
transformer = Pipeline([
("encoder", LeaveOneOutEncoder(cols=[0])),
("scaler", StandardScaler())
])
_ = GroupedTransformer(transformer, groups=["attack_type"], check_X = True).fit(X, y).transform(X)
transformer = Pipeline([
("scaler", StandardScaler()),
("encoder", LeaveOneOutEncoder(cols=[0]))
])
_ = GroupedTransformer(transformer, groups=["attack_type"], check_X = True).fit(X, y).transform(X)
estimator = Pipeline([
("scaler", StandardScaler()),
("encoder", LeaveOneOutEncoder(cols=[0])),
("model", LinearRegression())
])
preds = GroupedPredictor(estimator, groups=["attack_type"], check_X = True).fit(X, y).predict(X)
That seems like a straightforward fix. Maybe add this example as a unit test just to play it safe?