scikit-lego icon indicating copy to clipboard operation
scikit-lego copied to clipboard

[BUG] GroupedTransformer can't deal with transformers that use `y`

Open sergiocalde94 opened this issue 3 years ago • 5 comments

Hi! First of all thanks for your package, it's awesome! (Same as calmcode and other resources you built).

I'm trying to perform a LeaveOneOutEncoder transformation for each group of my dataset and I think it's not possible (or I can't get it), the problem is that within sklego/meta/grouped_transformer.py the fit is called without y and it throws an error: TypeError: fit() missing 1 required positional argument: 'y'.

I post here a dummy example that has the same functionality that the thing I'm trying to do in my code:

import pandas as pd
import numpy as np

from sklego.datasets import load_heroes
from sklego.meta import GroupedTransformer
from category_encoders import LeaveOneOutEncoder


df_heroes = load_heroes(as_frame=True).replace([np.inf, -np.inf], np.nan).dropna()

# Dummy example
X = df_heroes[['attack_type', 'health', 'attack_spd']]
y = df_heroes['attack']

GroupedTransformer(LeaveOneOutEncoder(), groups=['attack_type']).fit_transform(X, y)

Thansk in advance!

sergiocalde94 avatar Apr 20 '21 23:04 sergiocalde94

Now I'm using this as alternative solution:

import pandas as pd
import numpy as np

from sklego.datasets import load_heroes
from category_encoders import LeaveOneOutEncoder


df_heroes = load_heroes(as_frame=True).replace([np.inf, -np.inf], np.nan).dropna()

# Dummy example
X = df_heroes[['attack_type', 'health', 'attack_spd']]
y = df_heroes['attack']

attack_types = X['attack_type'].unique()

encoders_fit = {
    attack_type: LeaveOneOutEncoder().fit(
        X[X['attack_type'] == attack_type].drop(columns=['attack_type']),
        y[X['attack_type'] == attack_type]
    )
    for attack_type in attack_types
}

X_transformed = pd.concat(
    [
        encoders_fit[attack_type].transform(
            X[X['attack_type'] == attack_type].drop(columns=['attack_type']),
            y[X['attack_type'] == attack_type]
        )
        for attack_type in attack_types
    ]
)

sergiocalde94 avatar Apr 21 '21 00:04 sergiocalde94

I'm very happy to hear that you enjoy the scikit-lego project as well as calmcode, but I would like the emphasize that scikit-lego isn't my package. It's a collaborative effort from a lot of folks and in particular, @MBrouns is also a co-creator/maintainer.

Looking at our grouped transformer, it indeed seems like we assume that any encoders we pass in do not use y, only X. I suppose we didn't consider transformers that use the label.

@MBrouns, it's a slight anti-pattern, but I think I'm not opposed to supporting these sorts of encoders. Any comments on your end?

koaning avatar Apr 21 '21 04:04 koaning

It looks like the problem appears here: https://github.com/koaning/scikit-lego/blob/main/sklego/meta/grouped_transformer.py#L90

for all the estimators fitted on the subgroups we pass y as far as I can tell, but the global fallback model is fitted without it. The bit you linked @koaning is in the transform step and although the LeaveOneOutEncoder does use y during transform as well I would expect a different error.

I'm all OK for making the changes to both parts, but I would like to test whether passing along y will actually work for the transform step. I remember from building the TrainOnlyTransformerMixin it was not very clear when y would get passed along to the transform step when running from a pipeline and when it wouldnt.

MBrouns avatar Apr 21 '21 06:04 MBrouns

Do you need any help with this issue? Is it something missed on my part?

sergiocalde94 avatar Jun 10 '21 11:06 sergiocalde94

You're free to pick it up if you'd like. We're still keeping an eye on the repository, but ever since the pandemic hit we've been less active in picking up the issues.

koaning avatar Jun 10 '21 13:06 koaning

I started to take a look into this. Maybe I am missing something, but by changing

if self.use_global_model:
-      self.fallback_ = clone(self.transformer).fit(X_value)
+      self.fallback_ = clone(self.transformer).fit(X_value, y)

and adding a check_X parameter, the results were as expected. Namely:

import pandas as pd
import numpy as np

from sklego.datasets import load_heroes
from sklego.meta import GroupedTransformer
from category_encoders import LeaveOneOutEncoder

df_heroes = load_heroes(as_frame=True).replace([np.inf, -np.inf], np.nan).dropna()
df_heroes["random"] = np.random.choice([0, 1], size=len(df_heroes))

# Dummy example
X = df_heroes[["random", "attack_type", "health", "attack_spd"]]
y = df_heroes["attack"]

is_melee = df_heroes["attack_type"] == "Melee"
X_melee = LeaveOneOutEncoder(cols=["random"]).fit(X.loc[is_melee], y.loc[is_melee]).transform(X.loc[is_melee])
X_ranged = LeaveOneOutEncoder(cols=["random"]).fit(X.loc[~is_melee], y.loc[~is_melee]).transform(X.loc[~is_melee])

X_naive = pd.concat([X_melee, X_ranged]).sort_index()
X_grouped = GroupedTransformer(LeaveOneOutEncoder(cols=[0]), groups=["attack_type"], check_X = True).fit(X, y).transform(X)

np.allclose(X_naive.drop(columns="attack_type").to_numpy(), X_grouped)

True

Notice that I couldn't pass cols="random" in LeaveOneOutEncoder because the transformer will receive a numpy array (output of _split_groups_and_values function).

Additional "sanity"/interoperability checks I performed are:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklego.meta import GroupedPredictor

transformer = Pipeline([
    ("encoder", LeaveOneOutEncoder(cols=[0])),
    ("scaler", StandardScaler())
])
_ = GroupedTransformer(transformer, groups=["attack_type"], check_X = True).fit(X, y).transform(X)

transformer = Pipeline([
    ("scaler", StandardScaler()),
    ("encoder", LeaveOneOutEncoder(cols=[0]))
])
_ = GroupedTransformer(transformer, groups=["attack_type"], check_X = True).fit(X, y).transform(X)

estimator = Pipeline([
    ("scaler", StandardScaler()),
    ("encoder", LeaveOneOutEncoder(cols=[0])),
    ("model", LinearRegression())
])
preds = GroupedPredictor(estimator, groups=["attack_type"], check_X = True).fit(X, y).predict(X)

FBruzzesi avatar Mar 04 '24 17:03 FBruzzesi

That seems like a straightforward fix. Maybe add this example as a unit test just to play it safe?

koaning avatar Mar 05 '24 08:03 koaning