skrub
skrub copied to clipboard
Expose `OnEachColumn` and `OnSubFrame`
This Stack Overflow issue sparked a discussion among the skrub developers about the easiest solution using skrub.
In addition to the skrub expressions, one solution that came to mind is to use skrub._on_each_column.OnEachColumn:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import skrub # your friend here
from skrub import selectors as s
from skrub._on_each_column import OnEachColumn
X = np.hstack((
np.random.random((1000, 2)),
np.random.randint(2, size=(1000, 2)))
)
X = pd.DataFrame(X, columns=list("abcd"))
y = np.random.random(1000)
# Create a column selector with arbitrary logic.
not_binary = s.filter(lambda col: col.nunique() > 2)
preprocessor = OnEachColumn(StandardScaler(), cols=not_binary)
print(preprocessor.fit_transform(X))
# Bring it into a sklearn pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
pipe = make_pipeline(preprocessor, Ridge()).fit(X, y)
- Should we publicly expose
OnEachColumnandOnSubFrame? Their main use case would be to apply a transformer using selectors - Should we rename these classes by highlighting how they relate to
skb.apply?Apply,ApplyCols?
As you know, I'm in favor of exposing at, with name "ApplyOnCols"
skrub meeting suggestion: ApplyToCols or ApplyToColumns
We can now answer the StackOverflow post :)